ML-PersRef: A Machine Learning-based Personalized Multimodal Fusion Approach for Referencing Outside Objects From a Moving Vehicle

Gomaa, Amr; Reyes, Guillermo; Feld, Michael · 2021 · ACM ICMI 2021

DOI: 10.1145/3462244.3479910

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of referencing objects outside a moving vehicle using multimodal interaction, specifically combining eye gaze and pointing gestures. While modern vehicles possess numerous sensors enabling novel interaction methods, existing systems often rely on single-modality approaches or limited fusion where one modality merely triggers another. The authors argue that explicit multimodal fusion is necessary to improve performance in dynamic driving scenarios but is hindered by complexity and individual behavioral differences. The study proposes a machine learning-based personalized fusion approach to overcome these limitations, aiming to enhance prediction accuracy by adapting to individual driver behaviors. The researchers conducted a within-subject experiment using a driving simulator with 39 participants. Participants performed referencing tasks on Points of Interest (PoIs) while driving, with data collected via hand-tracking cameras and eye-tracking glasses. The experimental design varied environment density and PoI distance to ensure robustness. The methodology compared three fusion strategies: Late Fusion (using pre-processed horizontal angles), Early Fusion (using raw sensor vectors), and a Hybrid approach that clusters participants based on their pointing and gaze accuracy using Gaussian Mixture Models. Frame selection methods were also tested, including using all frames, middle frames, or a combination of middle pointing and first gaze frames. Models included Support Vector Regression (SVR) and deep learning architectures (FCNN, CNN, LSTM). To evaluate performance, the authors introduced the Mean Relative Distance-agnostic Error (MRDE), a metric that normalizes error against the angular width of the target object, alongside Root Mean Squared Error (RMSE). The results demonstrate that multimodal fusion approaches outperform single-modality methods across various conditions. Specifically, the Early Fusion approach generally yielded better performance than Late Fusion. The study found that personalized models, which adapt to individual user clusters, significantly outperformed universal background models. This personalization was effective even with small data sizes, leveraging the transfer-of-learning concept to enhance prediction accuracy for specific drivers. The Hybrid fusion method, which applies early fusion separately to different behavioral clusters, proved particularly effective in capturing individualistic referencing behaviors. Deep learning models were compared against traditional machine learning, with SVR showing competitive results given the dataset size. The significance of this work lies in demonstrating that personalized multimodal fusion can substantially improve object referencing in dynamic environments like moving vehicles. By accounting for individual behavioral differences, the proposed system offers a more robust and adaptable solution for human-vehicle interaction. The findings suggest that future automotive interfaces should move beyond generic fusion rules toward personalized, learning-based systems that can handle the inherent variability in user behavior. This approach enhances the reliability of referencing tasks, which is critical for safety and usability in advanced driver assistance systems and autonomous vehicles.

Key finding

A personalized hybrid multimodal fusion approach combining gaze and pointing gestures outperforms both single-modality methods and non-personalized fusion models in accurately referencing objects outside a moving vehicle.

Methodology

simulator

Sample size: 39

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via openalex_abstract on 2026-05-08 (4 acquisition events logged).

StageOutcomeToolModelPromptAttemptsCompleted
discover success 1 2026-05-07
archive success unpaywall 5 2026-06-06
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success openalex 2 2026-05-08
promote success 1 2026-05-07
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.