Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles

Bossen, Tonko Emil Westerhof; Møgelmose, Andreas; Greer, Ross · 2025 · Unknown

DOI: 10.1109/cvprw67362.2025.00464

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This study investigates whether state-of-the-art vision-language models (VLMs) can accurately interpret dynamic pedestrian traffic gestures in a zero-shot setting, a capability critical for safe cooperative autonomous driving. The authors address the gap in existing research where VLMs struggle to distinguish between passive pedestrian movement and intentional, instructive nonverbal commands. To evaluate this, the researchers created and publicly released two custom datasets: “Acted Traffic Gestures” (ATG), featuring controlled videos of a single actor performing specific commands like “Stop,” “Reverse,” and “Hail,” and “Instructive Traffic Gestures In-The-Wild” (ITGI), comprising real-world dashcam footage of police officers directing traffic. The ATG dataset includes expert-generated captions and ground-truth labels to serve as baselines for evaluation. The experimental design evaluates three VLMs—VideoLLaMA2, VideoLLaMA3, and Qwen2—using three distinct metrics: embedded caption similarity, gesture classification, and pose sequence reconstruction. The models were tested on 8-frame video samples using various prompting strategies, ranging from blank prompts to those specifying autonomous vehicle context and objectives. The primary finding is that current VLMs perform poorly in understanding traffic gestures. In the caption similarity evaluation, model scores averaged below 0.59, significantly lower than the expert baseline of 0.70. In the classification task, F1 scores ranged from 0.14 to 0.39, compared to an expert baseline of 0.70. The models frequently hallucinated gestures or misinterpreted simple body positions as complex commands, such as identifying a walking pedestrian as waving. While pose reconstruction showed some potential, it lacked reliability without further refinement. The significance of these findings lies in demonstrating that existing foundation models are not yet robust enough for safety-critical applications involving human-vehicle interaction. The study highlights that VLMs fail to capture the semantic intent behind gestures, often confusing physical posture with communicative instruction. This underscores the necessity for specialized datasets and further research into intent prediction rather than mere motion recognition. By providing the ATG and ITGI datasets and open-source code, the authors establish a benchmark for future work aimed at enabling autonomous vehicles to safely interpret and respond to human nonverbal commands.

Key finding

Current state-of-the-art vision-language models demonstrate poor performance in zero-shot interpretation of pedestrian traffic gestures, with classification F1 scores ranging from 0.14 to 0.39 compared to an expert baseline of 0.70.

Methodology

dataset

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success unpaywall 2 2026-06-04
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success 1 2026-05-28
promote success 1 2026-06-04
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.