Evaluating Multimodal Vision-Language Model Prompting Strategies for Visual Question Answering in Road Scene Understanding
DOI: 10.1109/wacvw65960.2025.00115
archive: archived pipeline: cataloged verified
Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)
Summary
This paper evaluates the application of NVIDIA’s Vision-Language Model (ViLA) for Visual Question Answering (VQA) in the context of autonomous driving, specifically using the MAPLM-QA dataset. The research addresses the challenge of understanding complex traffic scenes by leveraging multimodal data, which includes high-resolution panoramic images and rendered Bird’s-Eye View (BEV) depictions of LiDAR 3D point clouds. The authors aim to determine how effectively VLMs can extract actionable insights from these inputs to support real-time decision-making in self-driving vehicles. The study utilizes a subset of 100 frames from the MAPLM-QA dataset, which provides question-answer pairs for four specific multiple-choice categories: lane counting (LAN), intersection recognition (INT), quality assessment of point cloud data (QLT), and scene recognition (SCN). The experimental design focuses on prompt engineering strategies to enhance ViLA’s performance. The authors tested four distinct prompting approaches: a baseline with direct questions only, initial concise prompts with basic definitions, detailed and lengthy prompts providing extensive visual cues, and a final hybrid approach combining concise definitions with detailed visual instructions. Performance was measured using frame-level accuracy (FRM) and question-level accuracy (QNS), alongside specific metrics for each category. The results demonstrate that prompt engineering significantly impacts model performance. The baseline approach yielded poor results, with a frame accuracy of 0% and low scores in lane counting (29%) and intersection recognition (3%). However, the introduction of detailed prompts improved overall question accuracy to 64.25% and frame accuracy to 13%. The model performed best in quality assessment (83%) and scene recognition (82%), indicating strong capabilities in identifying general road conditions and data clarity. Conversely, the model struggled with spatial reasoning tasks, achieving only 36% accuracy in lane counting and 56% in intersection recognition. Analysis revealed that the model often defaulted to dominant answer choices, such as predicting "Very Clear" for quality or "Crossroad" for intersections, rather than performing nuanced visual analysis. The findings highlight both the potential and limitations of current Vision-Language Models in traffic scene understanding. While ViLA shows promise in assessing data quality and general scene types, it lacks the fine-grained spatial reasoning required for precise lane counting and complex intersection identification. The study concludes that while VLMs can serve as a foundation for scalable traffic analysis, further research is needed to address biases in answer selection and improve the model's ability to interpret intricate spatial relationships within multimodal inputs. This work establishes a benchmark for future efforts to integrate VLMs into robust autonomous driving systems.
Key finding
Detailed prompt engineering significantly improved ViLA's accuracy in scene recognition and quality assessment tasks but failed to resolve its difficulties with lane counting and intersection recognition.
Methodology
simulation_modeling
Sample size: 100
Provenance
The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.
| Stage | Outcome | Tool | Model | Prompt | Attempts | Completed |
|---|---|---|---|---|---|---|
| discover | success | author_sweep | — | — | 2 | 2026-05-28 |
| archive | success | canonical_url | — | — | 1 | 2026-06-06 |
| extract | success | cached | — | — | 3 | 2026-06-10 |
| clean | success | clean | — | — | 1 | 2026-06-04 |
| chunk | success | chunk | — | — | 1 | 2026-06-04 |
| embed | success | embed | Qwen/Qwen3-Embedding-8B | — | 1 | 2026-06-04 |
| enrich | success | — | — | — | 1 | 2026-05-28 |
| promote | success | — | — | — | 1 | 2026-06-04 |
| summarize | success | llm | qwen3.6-27b-prismaquant | summ-v5 | 2 | 2026-06-10 |
| tag | success | vector_similarity | — | — | 15 | 2026-06-11 |
| verify | success | — | — | — | 2 | 2026-06-10 |
Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.
Topics
Ranked by relevance to this paper. Hover a topic for its definition.