Multidisciplinary Initiative to Create and Integrate Realistic Artificial Datasets
archive: archived pipeline: cataloged verified
Get this paper ↗ (full text — opens at the source; we link to it, we don't host it)
Summary
This research addresses the limitations of traditional crash prediction models, which typically focus on crash frequencies and rates rather than the underlying cause-and-effect relationships that drive safety outcomes. Because real-world data often lacks known causal structures, it is difficult to evaluate whether a model accurately captures these relationships. To resolve this, the Federal Highway Administration’s Exploratory Advanced Research Program sponsored a project to develop a framework for generating Realistic Artificial Datasets (RADs). These synthetic datasets mimic known causal relationships between contributing factors and crashes, allowing researchers to rigorously test how well statistical and machine-learning models reflect true causality. The initiative specifically targeted diamond interchanges, focusing on ramp terminal left-turn crashes and speed change lane crashes, facilities that are overrepresented in fatal and injury crashes but suffer from a lack of accurate, granular crash data. The researchers developed a three-step framework to generate RADs. First, they identified contributing factors—such as roadway geometry, traffic volume, and driver characteristics—using observed data from Washington and Missouri. Second, they established cause-effect relationships by synthesizing information from the Highway Safety Manual and the Crash Modification Factor Clearinghouse. Third, they generated crash counts using a hierarchical Poisson approach, adjusting distributions to match observed site characteristics and categorizing crashes by severity. To evaluate model performance, the team created a rubric scoring models on six criteria, including prediction accuracy and model inference. Three independent teams, unaware of the RAD generation procedures, applied statistical and machine-learning models to the datasets. Additionally, the team developed web-based software containing 196 pregenerated datasets and a custom request generator to facilitate broader research use. The results indicated that machine-learning models outperformed statistical models, particularly in the model inference criterion. The machine-learning models produced Crash Modification Factors closer to the true values used to generate the RADs, demonstrating a superior ability to capture nonlinear relationships between independent variables and crash frequency. Performance was notably higher for speed change lane datasets compared to ramp terminal datasets. Beyond tabular data, the researchers created virtual reality simulation testbeds using safety-critical events from the SHRP2 Naturalistic Driving Study. These testbeds reconstructed 114 left-turn and 310 speed change lane events in 3D environments, offering interactive views for evaluating behavioral and roadway countermeasures. The significance of this work lies in providing a standardized method for validating safety models against known causal truths, a capability previously unavailable with real-world data. The RAD framework and associated software enable researchers to compare modeling approaches objectively and reduce the effort required to prepare data for analysis. Furthermore, the virtual reality testbeds offer an engaging platform for human factors research, driver education, and the evaluation of countermeasures like in-vehicle alert systems. The generic nature of the framework allows for future application to other facilities with data scarcity, such as work zones and bicycle lanes, supporting the U.S. Department of Transportation’s goal of zero roadway fatalities through improved data-driven safety analysis.
Key finding
Machine-learning models outperformed statistical models in accurately inferring the assumed cause-and-effect relationships within the generated realistic artificial datasets.
Methodology
modeling
Provenance
The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via bulk_ingest_rosap on 2026-05-23 (6 acquisition events logged).
| Stage | Outcome | Tool | Model | Prompt | Attempts | Completed |
|---|---|---|---|---|---|---|
| discover | success | rosap | — | — | 2 | 2026-05-23 |
| archive | success | — | — | — | 1 | 2026-05-23 |
| extract | success | cached | — | — | 2 | 2026-06-10 |
| clean | success | — | — | — | 1 | 2026-06-01 |
| chunk | success | — | — | — | 1 | 2026-06-01 |
| embed | success | — | — | — | 1 | 2026-06-02 |
| enrich | success | — | — | — | 1 | 2026-05-23 |
| promote | success | — | — | — | 1 | 2026-05-23 |
| summarize | success | llm | qwen3.6-27b-prismaquant | summ-v5 | 3 | 2026-06-10 |
| tag | success | vector_similarity | — | — | 19 | 2026-06-11 |
| verify | success | — | — | — | 2 | 2026-06-10 |
Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.
Topics
Ranked by relevance to this paper. Hover a topic for its definition.
Information type
What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).
- Empirical Findings: crash risk outcomes
- Methodological Resource: dataset resource
- Theoretical Contribution: computational model