SimCoachCorpus: A naturalistic dataset with language and trajectories for embodied teaching

Sumner, Emily; Gopinath, Deepak; Dees, Laporsha; Gomez, P.R.C.; Cui, Xiongyi; Silva, Andrew; Costa, Jean; Morgan, Allison; Schrum, Mariah; Chen, Tiffany L. Bhattacharjee; Balachandran, Avinash; Rosman, Guy · 2025 · ArXiv.org

DOI: 10.48550/arxiv.2509.14548

archive: archived pipeline: cataloged verified

Get this paper ↗ (DOI — opens at the source; we link to it, we don't host it)

Summary

This paper introduces SIMCOACHCORPUS, a novel dataset designed to address the scarcity of longitudinal data capturing embodied motor skill acquisition through verbal instruction. While AI has advanced in educational domains, datasets linking language with physical action and long-term learning dynamics remain limited, particularly in high-performance contexts like motorsports. To fill this gap, the authors collected data from 29 participants driving a race car simulator for approximately 90 minutes. The study employed a controlled experimental design with two conditions: 15 participants received personalized, one-on-one coaching from a professional driving instructor, while 14 participants practiced without coaching. The dataset synchronizes dense vehicle state and control data, track maps, and cone landmarks with concurrent verbal coaching transcripts, terminal feedback after each lap, and participant self-reports on cognitive load and emotional state. The analysis of the dataset reveals significant differences in learning outcomes between the coached and self-practice groups. Coached participants demonstrated superior performance metrics, achieving over 35% improvement in lap times compared to less than 20% for self-practice participants. Furthermore, coached drivers exhibited significantly better adherence to the racing line and spent significantly less time out of bounds, indicating more consistent and controlled driving behavior. Statistical analysis also showed that coached participants reported higher levels of enjoyment and positive emotional states, though no significant differences were found in cognitive load. Linguistic analysis of the coaching transcripts identified distinct instruction categories, such as throttle, steering, and braking, which correlated strongly with specific track locations, such as sharp turns. Additionally, student compliance with instructions varied by category, with lateral position instructions receiving the lowest compliance rates. The paper demonstrates the utility of SIMCOACHCORPUS for machine learning applications, including in-context learning and imitation learning. Using large language models, the authors showed that providing concurrent feedback and segment-level metrics allows for the generation of specific, grounded terminal feedback. They also developed a multi-task learning model capable of imitating coach instructions and predicting student trajectories, achieving competitive performance metrics. The authors conclude that SIMCOACHCORPUS provides a unique resource for investigating motor learning dynamics, linguistic phenomena in teaching, and the development of computational models for automated coaching. By capturing the intricate relationship between verbal instruction and physical performance over time, the dataset supports research into AI systems that can assist in embodied learning domains, extending beyond driving to areas like sports, rehabilitation, and surgical training.

Key finding

Participants receiving personalized coaching in a driving simulator demonstrated significantly greater performance improvements and higher positive emotional states compared to those practicing without instruction.

Methodology

simulator

Sample size: 29

Provenance

The full processing record for this entry. Every stage of this paper's journey through the pipeline is logged — what ran, with which tool and model, how many attempts it took, and when it last completed. Discovered via author_sweep_intake on 2026-05-28.

StageOutcomeToolModelPromptAttemptsCompleted
discover success author_sweep 2 2026-05-28
archive success canonical_url 1 2026-06-03
extract success cached 5 2026-06-10
clean success clean 3 2026-06-03
chunk success chunk 3 2026-06-03
embed success embed Qwen/Qwen3-Embedding-8B 1 2026-06-03
enrich skipped 2 2026-06-03
promote success 1 2026-06-03
summarize success llm qwen3.6-27b-prismaquant summ-v5 2 2026-06-10
tag success vector_similarity 15 2026-06-11
verify success 2 2026-06-10

Summary generated by qwen3.6-27b-prismaquant on 2026-06-10; verification: verified.

Topics

Ranked by relevance to this paper. Hover a topic for its definition.

Information type

What kind of knowledge this paper contributes, grouped by family — independent of topic (what it is about) and method (how it was studied).