Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Shriram, S.; Perisetla, Srinivasa; Keskar, Aryan; Krishnaswamy, Harsha; Bossen, Tonko Emil Westerhof; Møgelmose, Andreas; Greer, Ross · 2025 · Unknown

DOI: 10.1109/case58245.2025.11163861

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the challenge of detecting novel, out-of-label hazardous objects in autonomous driving, a critical safety issue where traditional models fail due to reliance on predefined object categories. The authors propose a multi-agent vision-language system that integrates Vision-Language Models (VLMs) and Large Language Models (LLMs) to identify and describe anomalies—such as debris, animals, or erratic pedestrians—without prior training on these specific hazards. To support this research, the authors introduce COOOLER, an extended benchmark dataset derived from the COOOL challenge, featuring 200 video clips with denoised footage, watermark removal, and human-annotated hazard descriptions. The methodology employs a dual-track pipeline. Track 1 uses a VLM (OmniVLM) to generate detailed scene descriptions, which are then processed by an LLM (GPT-4o-mini) to extract and rank potential hazards, creating a Ranked Hazards Set (RHS). Track 2 utilizes a different VLM (ViLA) to comprehensively extract all objects in the scene, which are filtered by an LLM to produce an All Elements Set (AES). These two sets are cross-referenced by the LLM to identify common hazardous objects, resulting in a Critical Object Set (COS). Finally, the system verifies these hazards using OpenAI’s CLIP model to match visual snippets from bounding boxes with textual descriptions, ensuring precise localization. The study evaluates the system using the COOOLER dataset, employing cosine similarity to measure semantic alignment between predicted hazard descriptions and human-annotated ground truth. The authors define a successful detection as a cosine similarity score above 0.80. The reported results indicate a Best Exact Semantic Match (BESM) of 0.3922 and a Similarity Average Metric (SAM) of 0.3922. These metrics reflect the system's ability to capture the essence of hazardous events, though the scores highlight the difficulty of achieving exact semantic alignment in open-set scenarios. The significance of this work lies in its demonstration of zero-shot hazard detection capabilities, offering a pathway to improve autonomous vehicle safety in unpredictable environments. By bridging vision-language reasoning with improved localization, the system enhances the ability of autonomous agents to recognize and explain real-world anomalies. The release of the COOOLER dataset and associated tools provides the field with a standardized benchmark for evaluating open-set hazard detection, addressing limitations in existing datasets that lack temporal reasoning and diverse anomaly annotations.

Key finding

The proposed multi-agent vision-language system demonstrates the potential for zero-shot hazardous object detection but reveals current limitations in semantic alignment and contextual reasoning when evaluated against human-annotated ground truth.

Methodology

simulation_modeling

Sample size: 200

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success unpaywall 2 2026-06-04
extract success cached 3 2026-06-10
clean success clean 1 2026-06-04
chunk success chunk 1 2026-06-04
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-04
enrich success 1 2026-05-28
promote success 1 2026-06-04
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.