Datasets and Competitions

Our research team creates large datasets that are publicly available. We lead analyses of these and other datasets using NLP techniques and host competitions to attract world-wide talent into learning analytics. We also use large datasets to fine-tune and domain adapt language models to inform educational interventions.


PIILOThe Learning Agency Lab - PII Data Detection

The dataset consists of approximately 22,000 essays with surrogate identifiers replacing original PII, aimed at PII annotation, with 70% reserved for testing and external data encouraged for training.

KeystrokeEssayScoringLinking Writing Processes to Writing Quality

dataset comprises about 5000 logs of user inputs, such as keystrokes and mouse clicks, taken during the composition of an essay.

StudentLLMEssayLLM - Detect AI Generated Text

The dataset comprises about 10,000 essays, some written by students and some generated by a variety of large language models (LLMs).

CLASSEThe Common Lit Augmented Student Summary Evaluation corpus

This dataset contains around 24,000 student summaries from grades 3-12, scored on content and wording, for a competition aiming to predict similar scores on unseen topics.

CLEARThe CommonLit EAse of Readability Corpus

CLEAR provides unique readability scores for ~5,000 excerpts leveled for 3rd-12th grade readers along with information about the excerpts’ year of publishing, genre, and other meta-data.

PERSUADEPersuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements corpus

PERSUADE is an open-source corpus comprising over 25,000 essays annotated for argumentative and discourse elements and relationships between these elements. The corpus includes holistic quality scores for each essay and for each argumentative and discourse element. Lastly, PERSUADE includes detailed demographic information for the writers. Kaggle competition 2021: Feedback Prize - Evaluating Student Writing

ELLIPSEEnglish Language Learner Insight, Proficiency and Skills Evaluation Corpus

ELLIPSE comprises ~7,000 essays written by English Language Learners (ELL). The essays were written on 29 different independent prompts that required no background knowledge on the part of the writer. Individual difference information is made available for each essay, including economic status, gender, grade level (8-12), and race/ethnicity. Each essay was scored by two normed human raters for English language proficiency including an overall score of English proficiency and analytic scores for cohesion, syntax, vocabulary, phraseology, grammar, and conventions.


CommonLit Readability PrizeCreating a Readability Algorithm Feedback PrizeEvaluating Student Writing Feedback PrizePredicting Effective Arguments Feedback PrizeEnglish Language Learning