CoCAtt: A Cognitive-Conditioned Driver Attention Dataset (Supplementary Material)
URL: http://arxiv.org/abs/2207.04028v1
archive: archived pipeline: cataloged verified
Abstract
The task of driver attention prediction has drawn considerable interest among researchers in robotics and the autonomous vehicle industry. Driver attention prediction can play an instrumental role in mitigating and preventing high-risk events, like collisions and casualties. However, existing driver attention prediction models neglect the distraction state and intention of the driver, which can significantly influence how they observe their surroundings. To address these issues, we present a new driver attention dataset, CoCAtt (Cognitive-Conditioned Attention). Unlike previous driver attention datasets, CoCAtt includes per-frame annotations that describe the distraction state and intention of the driver. In addition, the attention data in our dataset is captured in both manual and autopilot modes using eye-tracking devices of different resolutions. Our results demonstrate that incorporating the above two driver states into attention modeling can improve the performance of driver attention prediction. To the best of our knowledge, this work is the first to provide autopilot attention data. Furthermore, CoCAtt is currently the largest and the most diverse driver attention dataset in terms of autonomy levels, eye tracker resolutions, and driving scenarios. CoCAtt is available for download at https://cocatt-dataset.github.io.
Summary
Supplementary material for CoCAtt, a cognitive-conditioned driver-attention dataset, describing dataset distribution, the noisy behavior of webcam-based eye tracking relative to a GP3 eye-tracker (center-shift bias and broader spatial distribution), and a coarse-to-fine gaze calibration network that fuses raw webcam gaze with RGB scene features via CNN+LSTM. The supplement details unconditioned, multi-branch, and modified-CondConv attention-prediction architectures, where conditional-convolution kernel weights depend on driver state (e.g., distraction), with CondConv injected at multiple layers and dropout raised to 0.7 to mitigate over-fitting.
Key finding
Driver-state-conditioned attention models using modified CondConv layers (with state-dependent routing of expert kernel weights) and a coarse-to-fine webcam-gaze calibration network address the center-shift and noise issues of low-cost webcam eye tracking in driver-attention prediction.
Methodology
Supplementary methods document for the CoCAtt dataset and baseline architectures: visualization of webcam vs. eye-tracker gaze heatmaps, AlexNet+LSTM attention backbone, and modified CondConv with N=4 expert weights and a 1D driver-state input to the routing function.
Quality score: 5 / 5