Robust Scenario Mining Assisted by Multimodal Semantics
DOI: 10.1109/iccvw69036.2025.00191
archive: archived pipeline: cataloged verified
Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)
Summary
This paper addresses the challenge of robust scenario mining in autonomous driving datasets, specifically targeting limitations in the RefAV framework, which uses Large Language Models (LLMs) to translate natural-language queries into executable code for retrieving driving scenarios. The authors identify three primary failures in existing LLM-based approaches: reliance on the quality of upstream 3D multi-object tracking data, the lack of direct linkage between natural-language descriptions and raw RGB images, and runtime errors or semantic inaccuracies in LLM-generated code. To mitigate these issues, the study proposes a multimodal semantics-assisted method that enhances retrieval precision and robustness. The proposed methodology introduces a dual-branch architecture comprising an image-semantic branch and a text-semantic branch. First, a coarse-grained filtering stage utilizes a CLIP encoder to compute cosine similarity between keyword embeddings extracted from natural-language queries and feature embeddings of objects detected in raw RGB frames via YOLOv8. This step isolates a candidate subset of 3D tracklets, effectively pruning the search space before detailed analysis. Second, the system employs a Fault-Tolerant Iterative Code Generation (FT-ICG) mechanism, where the LLM generates executable scripts composed of atomic functions. If the code fails during execution, the error message is fed back to the LLM to refine the script iteratively, up to five times. Additionally, specialized prompt engineering is applied to clarify parameter semantics for functions describing complex spatial relationships, reducing misinterpretations such as swapping subject and reference objects. Experiments were conducted on the Argoverse 2 dataset using various LLMs, including Qwen2.5-VL-7B and Gemini 2.5 Flash, and evaluated across two distinct 3D tracking pipelines (Le3DE2E and TransFusion). The results demonstrate consistent improvements over the baseline RefAV method. For instance, using Qwen2.5-VL-7B with the Le3DE2E tracker, the proposed method achieved a HOTA-Temporal score of 44.54 compared to 33.27 for the baseline, alongside significant gains in TS-F1 and Log-F1 metrics. Similar performance enhancements were observed with Gemini 2.5 Flash and the TransFusion tracker, confirming that the multimodal filtering and fault-tolerant mechanisms effectively reduce false positives and execution failures. The significance of this work lies in its ability to bridge the gap between abstract natural-language queries and concrete sensor data, offering a more reliable pipeline for autonomous vehicle validation. By integrating visual semantics directly into the retrieval process and enhancing the robustness of LLM code generation, the method reduces computational overhead and improves the fidelity of scenario discovery. This approach addresses critical hurdles in automated testing, such as "factual hallucination" in spatial reasoning and dependency on imperfect upstream tracking, thereby advancing the state of the art in data-driven scenario mining for safety-critical systems.
Key finding
The proposed multimodal semantics-assisted scenario mining method achieves consistent improvements in retrieval metrics, including HOTA-T and Log-F1, across multiple LLMs and tracking pipelines compared to the baseline RefAV framework.
Methodology
dataset
Provenance
The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.
| Stage | Outcome | Tool | Model | Prompt | Attempts | Completed |
|---|---|---|---|---|---|---|
| discover | success | author_sweep | — | — | 2 | 2026-05-28 |
| archive | success | canonical_url | — | — | 12 | 2026-06-06 |
| extract | success | cached | — | — | 3 | 2026-06-10 |
| clean | success | clean | — | — | 1 | 2026-06-04 |
| chunk | success | chunk | — | — | 1 | 2026-06-04 |
| embed | success | embed | Qwen/Qwen3-Embedding-8B | — | 1 | 2026-06-04 |
| enrich | success | — | — | — | 1 | 2026-05-28 |
| promote | success | — | — | — | 1 | 2026-06-04 |
| summarize | success | llm | qwen3.6-27b-prismaquant | summ-v5 | 2 | 2026-06-10 |
| tag | success | vector_similarity | — | — | 15 | 2026-06-11 |
| verify | success | — | — | — | 2 | 2026-06-10 |
Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.
Topics
Ranked by relevance to this paper. Hover a topic for its definition.