Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Greer, Ross; Andersen, Mathias Viborg; Møgelmose, Andreas; Trivedi, Mohan M. · 2024 · VBN Forskningsportal (Aalborg Universitet)

DOI: 10.48550/arxiv.2404.14906

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of driver activity classification, a critical component for ensuring road safety in both driver assistance systems and autonomous vehicle control transitions. The authors identify that traditional data-driven methods often struggle with generalization due to overfitting to specific driver identities (such as facial features or skin tone) and the difficulty of adapting to open-set environments where novel activities or unseen drivers appear. To mitigate these issues, the study proposes a novel approach leveraging generalizable representations from pretrained vision-language models, specifically using contrastively-learned embeddings that encode semantic meaning rather than raw pixel values. This method aims to provide robust, interpretable, and adaptable driver monitoring without requiring extensive fine-tuning for new subjects. The proposed method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net). The architecture processes synchronized video frames from three distinct in-cabin camera perspectives (dashboard, rear-view, and side-view). Each frame is encoded using a pretrained CLIP Vision Transformer encoder to generate 768-dimensional embeddings. These embeddings are passed through independent fully-connected encoders and then fused into a deep fully-connected network to predict class probabilities. To further enhance generalizability and reduce overfitting to visual artifacts, the authors introduce an order-based augmentation strategy, where the sequence of the three camera views is randomized during training, forcing the model to rely on consistent semantic features rather than view-specific pixel configurations. Post-processing involves a mode filter to smooth predictions over time, accounting for the natural duration of driver activities. The model was evaluated on the Naturalistic Driving Action Recognition Dataset from the AI City Challenge, which contains approximately 62 hours of footage from 69 participants performing 16 distinct activities, including normal driving, phone use, eating, and reaching. The evaluation utilized a 7-fold cross-validation scheme with strict subject separation between training and testing sets to assess generalizability. The SRLF-Net achieved an average accuracy of 71.64% across all classes, significantly outperforming random selection. Analysis revealed a bias toward the majority class ("Normal Forward Driving"), which constituted 59.01% of the data. When excluding the normal driving class to evaluate performance on distracting activities alone, the model achieved 70.06% accuracy, with the mode filter contributing a substantial improvement from 63.66%. Phone calls and hand-on-head gestures were the most accurately classified distraction types. The study concludes that vision-language representations offer a promising avenue for driver monitoring by providing a balance of accuracy and interpretability through natural language descriptors. The use of foundation model encodings allows the system to generalize better to unseen drivers and activities compared to traditional pixel-based methods. The authors suggest that this approach facilitates the development of more robust, open-set capable monitoring systems that can be directly explained via language. Future work includes integrating temporal models like LSTMs, exploring open-set novelty detection, and evaluating the method on datasets with inconsistent camera views to further enhance real-world applicability.

Key finding

The proposed method using vision-language representations achieves an average accuracy of 71.64% across 16 driver activity classes, demonstrating strong performance and generalizability compared to random selection baselines.

Methodology

dataset

Sample size: 69

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

Stage	Outcome	Tool	Model	Prompt	Attempts	Completed
discover	success	author_sweep	—	—	2	2026-05-28
archive	success	canonical_url	—	—	1	2026-06-04
extract	success	cached	—	—	3	2026-06-10
clean	success	clean	—	—	1	2026-06-04
chunk	success	chunk	—	—	1	2026-06-04
embed	success	embed	Qwen/Qwen3-Embedding-8B	—	1	2026-06-04
enrich	success	—	—	—	1	2026-05-28
promote	success	—	—	—	1	2026-06-04
summarize	success	llm	qwen3.6-27b-prismaquant	summ-v5	2	2026-06-10
tag	success	vector_similarity	—	—	15	2026-06-11
verify	success	—	—	—	2	2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

Information type

What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).

Methodological Resource: tool software
Theoretical Contribution: computational model, conceptual framework