Behavior-based Predictive Safety Analytics - Pilot Study [supporting datasets]
archive: archived pipeline: cataloged verified
Get this paper ↗ (full text — opens at the source; we link to it, we don't host it)
Summary
This document serves as a metadata record and data description for the dataset supporting the report "Behavior-based Predictive Safety Analytics – Pilot Study," funded by the U.S. Department of Transportation and preserved by the Virginia Tech Transportation Institute. The primary research objective was to investigate and develop statistical models capable of predicting individual driver crash involvement. The study aimed to determine if driving style, demographic information, and behavioral history could serve as reliable predictors for safety outcomes. The methodology relied on a subset of data from the Strategic Highway Research Program 2 (SHRP2) Naturalistic Driving Study. The researchers applied specific inclusion criteria to construct the analytical dataset, requiring participants to have engaged in SHRP2 data collection for at least seven months and to have driven more than 1,000 miles during the designated study period. For each included driver, a six-month interval (months 2–7 of data collection) was extracted to calculate driving style measures and assess crash or near-crash involvement. Additionally, questionnaire data collected prior to the start of SHRP2 data collection were retrieved to capture driver behaviors and risk perception. This rigorous selection process resulted in a final dataset comprising 2,800 drivers, representing 3.91 million trips, 27.16 million miles of driving distance, and 0.69 million driving hours. The resulting dataset is structured at the driver level with continuous variables. It integrates multiple data types, including questionnaire factors regarding driver behaviors and risk perception, exposure metrics based on time, hours, and trips, crash-related data, and driver behavior variables mined from the six-month study period. The data package includes an Excel file containing the processed data and a PDF data dictionary specifying variable definitions, with "NA" used to denote missing values. The dataset is categorized under Engineering and includes keywords such as Crash, Near Crash, Driver Behavior Questionnaire, Crash Rate, and Driver Behaviors. The significance of this work lies in its contribution to predictive safety analytics by providing a comprehensive, naturalistic driving dataset that links behavioral history with safety outcomes. By making this data publicly available through the Virginia Tech Transportation Institute repository and the National Transportation Library, the study supports further research into how individual driving styles and demographic factors influence crash risk. The dataset enables researchers to validate and expand upon statistical models for predicting crash involvement, thereby advancing the field of transportation safety through data-driven insights.
Key finding
The compiled driver-level dataset covers 2,800 drivers, 3.91 million trips, 27.16 million miles, and 0.69 million driving hours over a six-month window.
Methodology
dataset
Sample size: 2800
Provenance
The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via bulk_ingest_rosap on 2026-05-23 (7 acquisition events logged).
| Stage | Outcome | Tool | Model | Prompt | Attempts | Completed |
|---|---|---|---|---|---|---|
| discover | success | rosap | — | — | 2 | 2026-05-23 |
| archive | success | — | — | — | 1 | 2026-05-23 |
| extract | success | cached | — | — | 3 | 2026-06-10 |
| clean | success | — | — | — | 1 | 2026-06-01 |
| chunk | success | — | — | — | 1 | 2026-06-01 |
| embed | success | — | — | — | 1 | 2026-06-02 |
| enrich | success | — | — | — | 1 | 2026-05-23 |
| promote | success | — | — | — | 1 | 2026-05-23 |
| summarize | success | llm | qwen3.6-27b-prismaquant | summ-v5 | 4 | 2026-06-10 |
| tag | success | vector_similarity | — | — | 19 | 2026-06-11 |
| verify | partial | — | — | — | 3 | 2026-06-10 |
Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified_with_issues.
Topics
Ranked by relevance to this paper. Hover a topic for its definition.
- telematics crash prediction
- naturalistic crash near crash
- incidence prevalence
- sex gender
- induced exposure
- exposure measurement
Information type
What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).
- Empirical Findings: crash risk outcomes
- Methodological Resource: dataset resource, tool software