Driver Activity Classification Using Generalizable Representations from Vision-Language Models
DOI: 10.48550/arxiv.2404.14906
archive: archived pipeline: cataloged verified
Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)
Summary
This paper addresses the challenge of driver activity classification, a critical component for ensuring road safety in both driver assistance systems and autonomous vehicle control transitions. The authors identify that traditional data-driven methods often struggle with generalization due to overfitting to specific driver identities (such as facial features or skin tone) and the difficulty of adapting to open-set environments where novel activities or unseen drivers appear. To mitigate these issues, the study proposes a novel approach leveraging generalizable representations from pretrained vision-language models, specifically using contrastively-learned embeddings that encode semantic meaning rather than raw pixel values. This method aims to provide robust, interpretable, and adaptable driver monitoring without requiring extensive fine-tuning for new subjects. The proposed method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net). The architecture processes synchronized video frames from three distinct in-cabin camera perspectives (dashboard, rear-view, and side-view). Each frame is encoded using a pretrained CLIP Vision Transformer encoder to generate 768-dimensional embeddings. These embeddings are passed through independent fully-connected encoders and then fused into a deep fully-connected network to predict class probabilities. To further enhance generalizability and reduce overfitting to visual artifacts, the authors introduce an order-based augmentation strategy, where the sequence of the three camera views is randomized during training, forcing the model to rely on consistent semantic features rather than view-specific pixel configurations. Post-processing involves a mode filter to smooth predictions over time, accounting for the natural duration of driver activities. The model was evaluated on the Naturalistic Driving Action Recognition Dataset from the AI City Challenge, which contains approximately 62 hours of footage from 69 participants performing 16 distinct activities, including normal driving, phone use, eating, and reaching. The evaluation utilized a 7-fold cross-validation scheme with strict subject separation between training and testing sets to assess generalizability. The SRLF-Net achieved an average accuracy of 71.64% across all classes, significantly outperforming random selection. Analysis revealed a bias toward the majority class ("Normal Forward Driving"), which constituted 59.01% of the data. When excluding the normal driving class to evaluate performance on distracting activities alone, the model achieved 70.06% accuracy, with the mode filter contributing a substantial improvement from 63.66%. Phone calls and hand-on-head gestures were the most accurately classified distraction types. The study concludes that vision-language representations offer a promising avenue for driver monitoring by providing a balance of accuracy and interpretability through natural language descriptors. The use of foundation model encodings allows the system to generalize better to unseen drivers and activities compared to traditional pixel-based methods. The authors suggest that this approach facilitates the development of more robust, open-set capable monitoring systems that can be directly explained via language. Future work includes integrating temporal models like LSTMs, exploring open-set novelty detection, and evaluating the method on datasets with inconsistent camera views to further enhance real-world applicability.
Key finding
The proposed method using vision-language representations achieves an average accuracy of 71.64% across 16 driver activity classes, demonstrating strong performance and generalizability compared to random selection baselines.
Methodology
dataset
Sample size: 69
Provenance
The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.
| Stage | Outcome | Tool | Model | Prompt | Attempts | Completed |
|---|---|---|---|---|---|---|
| discover | success | author_sweep | — | — | 2 | 2026-05-28 |
| archive | success | canonical_url | — | — | 1 | 2026-06-04 |
| extract | success | cached | — | — | 3 | 2026-06-10 |
| clean | success | clean | — | — | 1 | 2026-06-04 |
| chunk | success | chunk | — | — | 1 | 2026-06-04 |
| embed | success | embed | Qwen/Qwen3-Embedding-8B | — | 1 | 2026-06-04 |
| enrich | success | — | — | — | 1 | 2026-05-28 |
| promote | success | — | — | — | 1 | 2026-06-04 |
| summarize | success | llm | qwen3.6-27b-prismaquant | summ-v5 | 2 | 2026-06-10 |
| tag | success | vector_similarity | — | — | 15 | 2026-06-11 |
| verify | success | — | — | — | 2 | 2026-06-10 |
Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.
Topics
Ranked by relevance to this paper. Hover a topic for its definition.
Information type
What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).
- Methodological Resource: tool software
- Theoretical Contribution: computational model, conceptual framework