Robust Scenario Mining Assisted by Multimodal Semantics

Chen, Yifei; Greer, Ross · 2025 · Unknown

DOI: 10.1109/iccvw69036.2025.00191

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of robust scenario mining in autonomous driving datasets, specifically targeting limitations in the RefAV framework, which uses Large Language Models (LLMs) to translate natural-language queries into executable code for retrieving driving scenarios. The authors identify three primary failures in existing LLM-based approaches: reliance on the quality of upstream 3D multi-object tracking data, the lack of direct linkage between natural-language descriptions and raw RGB images, and runtime errors or semantic inaccuracies in LLM-generated code. To mitigate these issues, the study proposes a multimodal semantics-assisted method that enhances retrieval precision and robustness. The proposed methodology introduces a dual-branch architecture comprising an image-semantic branch and a text-semantic branch. First, a coarse-grained filtering stage utilizes a CLIP encoder to compute cosine similarity between keyword embeddings extracted from natural-language queries and feature embeddings of objects detected in raw RGB frames via YOLOv8. This step isolates a candidate subset of 3D tracklets, effectively pruning the search space before detailed analysis. Second, the system employs a Fault-Tolerant Iterative Code Generation (FT-ICG) mechanism, where the LLM generates executable scripts composed of atomic functions. If the code fails during execution, the error message is fed back to the LLM to refine the script iteratively, up to five times. Additionally, specialized prompt engineering is applied to clarify parameter semantics for functions describing complex spatial relationships, reducing misinterpretations such as swapping subject and reference objects. Experiments were conducted on the Argoverse 2 dataset using various LLMs, including Qwen2.5-VL-7B and Gemini 2.5 Flash, and evaluated across two distinct 3D tracking pipelines (Le3DE2E and TransFusion). The results demonstrate consistent improvements over the baseline RefAV method. For instance, using Qwen2.5-VL-7B with the Le3DE2E tracker, the proposed method achieved a HOTA-Temporal score of 44.54 compared to 33.27 for the baseline, alongside significant gains in TS-F1 and Log-F1 metrics. Similar performance enhancements were observed with Gemini 2.5 Flash and the TransFusion tracker, confirming that the multimodal filtering and fault-tolerant mechanisms effectively reduce false positives and execution failures. The significance of this work lies in its ability to bridge the gap between abstract natural-language queries and concrete sensor data, offering a more reliable pipeline for autonomous vehicle validation. By integrating visual semantics directly into the retrieval process and enhancing the robustness of LLM code generation, the method reduces computational overhead and improves the fidelity of scenario discovery. This approach addresses critical hurdles in automated testing, such as "factual hallucination" in spatial reasoning and dependency on imperfect upstream tracking, thereby advancing the state of the art in data-driven scenario mining for safety-critical systems.

Key finding

The proposed multimodal semantics-assisted scenario mining method achieves consistent improvements in retrieval metrics, including HOTA-T and Log-F1, across multiple LLMs and tracking pipelines compared to the baseline RefAV framework.

Methodology

dataset

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success canonical_url 12 2026-06-06
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success 1 2026-05-28
promote success 1 2026-06-04
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.