Explaining Autonomous Driving Actions with Visual Question Answering

Atakishiyev, Shahin; Salameh, Mohammad; Babiker, Housam; Goebel, Randy · 2023 · Crossref

DOI: 10.1109/itsc57777.2023.10421901

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper addresses the critical need for explainability in autonomous driving systems, motivated by safety concerns, regulatory requirements like the EU’s GDPR, and the necessity for transparency in AI decision-making. While deep learning has advanced end-to-end driving capabilities, the "black box" nature of these models hinders trust and legal accountability. The authors propose a Visual Question Answering (VQA) framework to provide causal, natural language explanations for the actions taken by self-driving vehicles, thereby bridging the gap between computer vision and natural language processing to interpret real-time driving decisions. The methodology involves a three-step process within the CARLA simulation environment. First, a Deep Deterministic Policy Gradient (DDPG) reinforcement learning agent is trained to navigate simulated towns, collecting video logs of its driving behavior. The state space includes vehicle velocity, lateral distance, and yaw angle, with rewards shaped to encourage lane adherence and collision avoidance. Second, the authors extract frames corresponding to five specific action categories: going straight, turning left, turning right, and turning left or right at T-junctions. They manually annotate these frames with question-answer pairs that justify the chosen action based on visual evidence (e.g., "Why is the car turning right?" answered with "Because the road is bending to the right"). Third, they fine-tune a pre-trained VGG-19 based VQA model on this dataset. The model takes an image frame and a text question as input to predict the most probable causal answer from a set of candidates. The study evaluates the framework’s ability to generalize to unseen driving scenes. The dataset comprises 250 annotated training frames and 100 test frames collected from two different simulated towns. The results demonstrate that the VQA mechanism can correctly identify and justify the ego vehicle’s actions in novel scenarios. For instance, the model successfully assigns high probability scores to correct causal explanations, such as identifying road curvature or the absence of obstacles as reasons for specific maneuvers. The empirical findings suggest that connecting vision and natural language allows for the rationalization of reinforcement learning agents' decisions in an intelligible manner. The significance of this work lies in presenting the first empirical study on using VQA for explaining autonomous driving actions. It contributes a novel dataset of image-question-answer triplets and demonstrates that VQA can serve as an effective tool for interpreting the temporal decisions of self-driving cars. By providing transparent, causal justifications for vehicle behavior, this approach supports enhanced driving safety, regulatory compliance, and user trust. The authors conclude that this framework offers a viable path toward more rigorous and interpretable autonomous driving systems, suggesting further development of VQA models for real-world deployment.

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed.

Stage	Outcome	Tool	Model	Prompt	Attempts	Completed
discover	success	Crossref	—	—	1	2026-06-20
archive	success	semantic_scholar	—	—	6	2026-06-26
extract	success	cached	—	—	2	2026-06-26
clean	success	clean	—	—	1	2026-06-20
chunk	success	chunk	—	—	1	2026-06-20
embed	success	embed	Qwen/Qwen3-Embedding-8B	—	1	2026-06-20
enrich	success	openalex	—	—	1	2026-06-20
promote	success	—	—	—	1	2026-06-20
summarize	success	llm	qwen3.6-27b-prismaquant	summ-v5	1	2026-06-26
tag	success	vector_similarity	—	—	6	2026-06-20
verify	success	—	—	—	1	2026-06-26

Summary generated by qwen3.6-27b-prismaquant on 2026-06-26; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

generative ai voice assistants

Information type

What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).

Theoretical Contribution: computational model