Ensemble Learning for Fusion of Multiview Vision with Occlusion and Missing Information: Framework and Evaluations with Real-World Data and Applications in Driver Hand Activity Recognition

Greer, Ross; Trivedi, Mohan M. · 2023 · arXiv (Cornell University)

DOI: 10.48550/arxiv.2301.12592

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of robust driver state monitoring in autonomous vehicles, specifically focusing on classifying driver hand activity and held objects. The motivation stems from the high risk of crashes associated with driver distraction and the limitations of current steering-wheel sensors, which cannot distinguish between specific hand activities or detect hands off the wheel. The authors identify a critical gap in existing multi-view sensor fusion research: most methods assume complete data from all sensors, whereas real-world applications suffer from "irregular redundancy," where views are intermittently missing due to occlusion, noise, or sensor failure. The study proposes a framework for ensemble learning that handles these missing data instances to improve prediction reliability. The methodology employs a multi-camera system with four infrared cameras positioned at the steering wheel, rearview mirror, dashboard center, and dashboard driver side. The pipeline involves detecting the driver using Faster-RCNN, extracting 2D pose keypoints using HRNet, and cropping images around the wrists. These crops are fed into Convolutional Neural Networks (ResNet-50 backbones) to classify hand locations (e.g., steering wheel, lap) and held objects (e.g., phone, beverage). To handle missing views, the authors use single imputation, replacing missing frames with zero-valued images. The study compares single-view models against four ensemble strategies: naive voting, weighted majority voting, Bayesian model combination, and late fusion. Late fusion combines feature maps from parallel CNNs via fully connected layers, allowing the model to learn relationships between views. Experiments were conducted on a dataset of 19 subjects, comprising approximately 81,000 frames for hand location and 128,000 frames for held object classification. The results demonstrate that late fusion outperforms single-view models, even the best-placed single camera, in estimating hand positions and objects within the training group. Furthermore, the multi-camera late fusion framework achieved the best average performance in cross-group validation, indicating superior generalization to unseen drivers. The fusion approach also surpassed ensemble weighted majority and model combination schemes. The study highlights that while individual camera views have varying data availability and utility, the fusion method effectively leverages redundant and supplemental information across views to maintain high accuracy despite frequent occlusions. The significance of this work lies in its contribution to safety-critical systems for autonomous driving, particularly during control transitions where driver readiness is paramount. By demonstrating that late fusion can robustly handle irregular redundancy and missing data, the paper provides a generalized framework applicable to other multi-modal sensing tasks beyond driver monitoring. The findings suggest that multi-view ensemble learning offers a more reliable solution than single-camera systems or simple voting mechanisms, enhancing the ability of intelligent vehicles to monitor driver states continuously and accurately in real-world conditions.

Key finding

A late-fusion ensemble learning approach combining parallel convolutional neural networks outperforms single-camera models and other ensemble voting schemes in recognizing driver hand activities and held objects, even when facing occlusion and missing data from multiple camera views.

Methodology

lab_experiment

Sample size: 19

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success canonical_url 1 2026-06-04
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success 1 2026-05-28
promote success 1 2026-06-04
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.