Robust Scenario Mining Assisted by Multimodal Semantics

Chen, Yifei; Greer, Ross · 2025 · Unknown

DOI: 10.1109/iccvw69036.2025.00191

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of robust scenario mining in autonomous driving datasets, specifically targeting limitations in the RefAV framework, which uses Large Language Models (LLMs) to translate natural-language queries into executable code for retrieving driving scenarios. The authors identify three primary failures in existing LLM-based approaches: reliance on the quality of upstream 3D multi-object tracking data, the lack of direct linkage between natural-language descriptions and raw RGB images, and runtime errors or semantic inaccuracies in LLM-generated code. To mitigate these issues, the study proposes a multimodal semantics-assisted method that enhances retrieval precision and robustness. The proposed methodology introduces a dual-branch architecture comprising an image-semantic branch and a text-semantic branch. First, a coarse-grained filtering stage utilizes a CLIP encoder to compute cosine similarity between keyword embeddings extracted from natural-language queries and feature embeddings of objects detected in raw RGB frames via YOLOv8. This step isolates a candidate subset of 3D tracklets, effectively pruning the search space before detailed analysis. Second, the system employs a Fault-Tolerant Iterative Code Generation (FT-ICG) mechanism, where the LLM generates executable scripts composed of atomic functions. If the code fails during execution, the error message is fed back to the LLM to refine the script iteratively, up to five times. Additionally, specialized prompt engineering is applied to clarify parameter semantics for functions describing complex spatial relationships, reducing misinterpretations such as swapping subject and reference objects. Experiments were conducted on the Argoverse 2 dataset using various LLMs, including Qwen2.5-VL-7B and Gemini 2.5 Flash, and evaluated across two distinct 3D tracking pipelines (Le3DE2E and TransFusion). The results demonstrate consistent improvements over the baseline RefAV method. For instance, using Qwen2.5-VL-7B with the Le3DE2E tracker, the proposed method achieved a HOTA-Temporal score of 44.54 compared to 33.27 for the baseline, alongside significant gains in TS-F1 and Log-F1 metrics. Similar performance enhancements were observed with Gemini 2.5 Flash and the TransFusion tracker, confirming that the multimodal filtering and fault-tolerant mechanisms effectively reduce false positives and execution failures. The significance of this work lies in its ability to bridge the gap between abstract natural-language queries and concrete sensor data, offering a more reliable pipeline for autonomous vehicle validation. By integrating visual semantics directly into the retrieval process and enhancing the robustness of LLM code generation, the method reduces computational overhead and improves the fidelity of scenario discovery. This approach addresses critical hurdles in automated testing, such as "factual hallucination" in spatial reasoning and dependency on imperfect upstream tracking, thereby advancing the state of the art in data-driven scenario mining for safety-critical systems.

Key finding

The proposed multimodal semantics-assisted scenario mining method achieves consistent improvements in retrieval metrics, including HOTA-T and Log-F1, across multiple LLMs and tracking pipelines compared to the baseline RefAV framework.

Methodology

dataset

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

Stage	Outcome	Tool	Model	Prompt	Attempts	Completed
discover	success	author_sweep	—	—	2	2026-05-28
archive	success	canonical_url	—	—	12	2026-06-06
extract	success	cached	—	—	3	2026-06-10
clean	success	clean	—	—	1	2026-06-04
chunk	success	chunk	—	—	1	2026-06-04
embed	success	embed	Qwen/Qwen3-Embedding-8B	—	1	2026-06-04
enrich	success	—	—	—	1	2026-05-28
promote	success	—	—	—	1	2026-06-04
summarize	success	llm	qwen3.6-27b-prismaquant	summ-v5	2	2026-06-10
tag	success	vector_similarity	—	—	15	2026-06-11
verify	success	—	—	—	2	2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

generative ai voice assistants