Behavior-based Predictive Safety Analytics – Pilot Study

Engström, Johan; Miller, Andrew; Huang, Wenyan; Soccolich, Susan A.; Machiani, Sahar Ghanipoor; Jahangiri, Arash; Dreger, Felix; de Winter, Joost · 2019 · ROSA P / Safety through Disruption (Safe-D) University Transportation Center (UTC)

archive: archived pipeline: cataloged verified

Get this paper ↗ (full text — opens at the source; we link to it, we don't host it)

Summary

This pilot study investigates the feasibility of developing statistical models to predict individual driver crash involvement based on driving style, demographic, and behavioral history variables. The research addresses the longstanding challenge of identifying high-risk drivers before accidents occur, a capability with significant applications in fleet safety management and insurance. While previous naturalistic driving studies focused on specific risky tasks, this project aims to map enduring personal factors and persistent driving patterns to crash risk. The study was designed to establish a foundation for future comprehensive research by testing predictive analytics on large-scale naturalistic data. The researchers utilized a subset of the Second Strategic Highway Research Program (SHRP2) dataset, comprising 2,458 drivers who drove over 27 million miles during a six-month study period. The analysis focused on enduring personal factors, including demographics (age, gender), self-reported driving history (violations, crashes), personality traits (sensation seeking, risk perception), and operationalized driving styles. Driving style was measured by the frequency of six kinematic events—hard starts, stops, and turns—calculated using specific g-force thresholds optimized via the Akaike Information Criterion. The dependent variables were binary indicators of crash involvement and crash/near-crash (CNC) involvement. The study employed logistic regression and random forest classifiers to predict these outcomes, evaluating model performance through recall, precision, and accuracy metrics. The results confirmed the presence of differential crash involvement, demonstrating that a small proportion of drivers accounts for the majority of risk. Specifically, approximately 25% of drivers accounted for 80% of CNC events, aligning with the Pareto principle. Significant associations were found between enduring personal factors and crash involvement. Both logistic regression and random forest classifiers proved relatively successful in predicting CNC involvement based on individual characteristics. However, the models’ ability to specifically predict actual crash involvement was more limited compared to their performance with the broader CNC category. The study also identified that commercial data sources, such as those from Lytx and SmartDrive, could potentially support larger-scale academic research if anonymization and legal constraints are addressed. The significance of this work lies in validating the concept that individual driver characteristics can be used to predict safety outcomes, providing a basis for behavior-based predictive safety analytics. The findings suggest that while predicting general safety-critical events is feasible, predicting specific crashes remains challenging with current methods and data granularity. The study highlights the need for larger datasets and refined modeling techniques to improve predictive accuracy. By establishing a conceptual framework and demonstrating proof-of-concept models, the project lays the groundwork for future efforts to identify risky drivers proactively, potentially enhancing safety interventions in commercial fleets and personal insurance contexts.

Key finding

Logistic regression and random forest classifiers were relatively successful in predicting near-crash involvement based on individual characteristics, but the ability to specifically predict involvement in crashes was more limited.

Methodology

naturalistic

Sample size: 2458

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via bulk_ingest_rosap on 2026-05-23 (6 acquisition events logged).

StageOutcomeToolModelPromptAttemptsCompleted
discover success rosap 2 2026-05-23
archive success 1 2026-05-23
extract success cached 2 2026-06-10
clean success 1 2026-06-01
chunk success 1 2026-06-01
embed success 1 2026-06-02
enrich success 1 2026-05-23
promote success 1 2026-05-23
summarize success llm qwen3.6-27b-prismaquant summ-v5 3 2026-06-10
tag success vector_similarity 19 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

Information type

What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).