Evaluating Vision-Language Models for Zero-Shot Detection, Classification, and Association of Motorcycles, Passengers, and Helmets

Choi, Lucas; Greer, Ross · 2024 · Unknown

DOI: 10.1109/vtc2024-fall63153.2024.10757944

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This study evaluates the efficacy of the OWLv2 vision-language foundation model for zero-shot detection and classification of motorcycles, passengers, and helmet-wearing statuses. Motivated by the high mortality rates associated with motorcycle accidents, particularly in regions like India where helmet compliance is inconsistent, the research aims to develop automated systems for traffic safety enforcement and infrastructure-to-vehicle communication. The authors address the challenge of incomplete and biased training datasets, which often lack instances of specific classes such as child passengers or second passengers wearing helmets, by leveraging zero-shot learning capabilities that do not require task-specific training data. The methodology employs a cascaded detection algorithm using the CVPR AI City Challenge dataset, consisting of 100 videos captured by infrastructure-mounted cameras. The process begins with OWLv2 detecting motorcycles, followed by the detection of human occupants within expanded bounding boxes around each motorcycle. Helmet status is determined by detecting helmets within cropped person instances. Due to OWLv2’s inability to accurately distinguish seating positions (e.g., driver vs. passenger), the authors integrated a supervised learning component using a modified AlexNet for seat classification. This hybrid approach combines zero-shot detection for motorcycles, persons, and helmets with supervised classification for seating positions. To address severe class imbalance in the dataset, the AlexNet was trained with adjusted class weights. Results indicate that OWLv2 achieved an average precision of 0.4122 for motorcycle detection and 0.3561 for person detection. For helmet classification, the model achieved an average precision of 0.5324, outperforming a naive classifier that always predicts helmet usage. The AlexNet seat classifier achieved 95.17% accuracy on the validation set, though it struggled with the underrepresented child passenger class. The study highlights significant challenges in real-world conditions, including low-resolution data, poor visibility due to night lighting or fog, and overlapping bounding boxes that confuse the model. Precision-recall curves demonstrate a trade-off between precision and recall, necessitating careful threshold tuning. The findings suggest that while zero-shot learning offers a promising avenue for handling unseen classes and reducing reliance on extensive annotated datasets, current foundation models face limitations in complex, noisy environments. The research underscores the potential of integrating vision-language models with supervised methods for robust traffic safety applications. Future work should focus on enhancing model robustness through pre-training on diverse datasets, improving preprocessing techniques, and refining localization capabilities to better handle occlusions and low-visibility scenarios. This approach could significantly advance automated traffic enforcement and vehicle safety systems.

Key finding

The OWLv2 vision-language model achieved an average precision of 0.5324 for zero-shot helmet detection, demonstrating potential for traffic safety applications despite limitations caused by data noise and class imbalance.

Methodology

dataset

Sample size: 100

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success canonical_url 1 2026-06-06
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success 1 2026-05-28
promote success 1 2026-06-04
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.