Ensemble Learning for Fusion of Multiview Vision with Occlusion and Missing Information: Framework and Evaluations with Real-World Data and Applications in Driver Hand Activity Recognition

Greer, Ross; Trivedi, Mohan M. · 2023 · arXiv (Cornell University)

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of robust driver state monitoring in autonomous vehicles, specifically focusing on classifying driver hand activity and held objects. The motivation stems from the high risk of crashes associated with driver distraction and the limitations of current steering-wheel sensors, which cannot distinguish between specific hand activities or detect hands off the wheel. The authors identify a critical gap in existing multi-view sensor fusion research: most methods assume complete data from all sensors, whereas real-world applications suffer from "irregular redundancy," where views are intermittently missing due to occlusion, noise, or sensor failure. The study proposes a framework for ensemble learning that handles these missing data instances to improve prediction reliability. The methodology employs a multi-camera system with four infrared cameras positioned at the steering wheel, rearview mirror, dashboard center, and dashboard driver side. The pipeline involves detecting the driver using Faster-RCNN, extracting 2D pose keypoints using HRNet, and cropping images around the wrists. These crops are fed into Convolutional Neural Networks (ResNet-50 backbones) to classify hand locations (e.g., steering wheel, lap) and held objects (e.g., phone, beverage). To handle missing views, the authors use single imputation, replacing missing frames with zero-valued images. The study compares single-view models against four ensemble strategies: naive voting, weighted majority voting, Bayesian model combination, and late fusion. Late fusion combines feature maps from parallel CNNs via fully connected layers, allowing the model to learn relationships between views. Experiments were conducted on a dataset of 19 subjects, comprising approximately 81,000 frames for hand location and 128,000 frames for held object classification. The results demonstrate that late fusion outperforms single-view models, even the best-placed single camera, in estimating hand positions and objects within the training group. Furthermore, the multi-camera late fusion framework achieved the best average performance in cross-group validation, indicating superior generalization to unseen drivers. The fusion approach also surpassed ensemble weighted majority and model combination schemes. The study highlights that while individual camera views have varying data availability and utility, the fusion method effectively leverages redundant and supplemental information across views to maintain high accuracy despite frequent occlusions. The significance of this work lies in its contribution to safety-critical systems for autonomous driving, particularly during control transitions where driver readiness is paramount. By demonstrating that late fusion can robustly handle irregular redundancy and missing data, the paper provides a generalized framework applicable to other multi-modal sensing tasks beyond driver monitoring. The findings suggest that multi-view ensemble learning offers a more reliable solution than single-camera systems or simple voting mechanisms, enhancing the ability of intelligent vehicles to monitor driver states continuously and accurately in real-world conditions.

Key finding

A late-fusion ensemble learning approach combining parallel convolutional neural networks outperforms single-camera models and other ensemble voting schemes in recognizing driver hand activities and held objects, even when facing occlusion and missing data from multiple camera views.

Methodology

lab_experiment

Sample size: 19

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

Stage	Outcome	Tool	Model	Prompt	Attempts	Completed
discover	success	author_sweep	—	—	2	2026-05-28
archive	success	canonical_url	—	—	1	2026-06-04
extract	success	cached	—	—	3	2026-06-10
clean	success	clean	—	—	1	2026-06-04
chunk	success	chunk	—	—	1	2026-06-04
embed	success	embed	Qwen/Qwen3-Embedding-8B	—	1	2026-06-04
enrich	success	—	—	—	1	2026-05-28
promote	success	—	—	—	1	2026-06-04
summarize	success	llm	qwen3.6-27b-prismaquant	summ-v5	2	2026-06-10
tag	success	vector_similarity	—	—	15	2026-06-11
verify	success	—	—	—	2	2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

distraction detection algorithms