Datasets and Competitions

Our research team creates large datasets that are publicly available. We lead analyses of these and other datasets using NLP techniques and host competitions to attract world-wide talent into learning analytics. We also use large datasets to fine-tune and domain adapt language models to inform educational interventions.

Datasets

Automatic Essay Scoring 2.0 The Learning Agency Lab - Automated Essay Scoring 2.0

The competition dataset comprises about 24000 student-written argumentative essays. Each essay was scored on a scale of 1 to 6. Your goal is to predict the score an essay received from its text.

corpus

PIILO The Learning Agency Lab - PII Data Detection

The dataset consists of approximately 22,000 essays with surrogate identifiers replacing original PII, aimed at PII annotation, with 70% reserved for testing and external data encouraged for training.

corpus

KeystrokeEssayScoring Linking Writing Processes to Writing Quality

dataset comprises about 5000 logs of user inputs, such as keystrokes and mouse clicks, taken during the composition of an essay.

corpus

StudentLLMEssay LLM - Detect AI Generated Text

The dataset comprises about 10,000 essays, some written by students and some generated by a variety of large language models (LLMs).

corpus

CLASSE The Common Lit Augmented Student Summary Evaluation corpus

This dataset contains around 24,000 student summaries from grades 3-12, scored on content and wording, for a competition aiming to predict similar scores on unseen topics.

corpus

CLEAR The CommonLit EAse of Readability Corpus

CLEAR provides unique readability scores for ~5,000 excerpts leveled for 3rd-12th grade readers along with information about the excerpts’ year of publishing, genre, and other meta-data.

corpus

PERSUADE Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements corpus

PERSUADE is an open-source corpus comprising over 25,000 essays annotated for argumentative and discourse elements and relationships between these elements. The corpus includes holistic quality scores for each essay and for each argumentative and discourse element. Lastly, PERSUADE includes detailed demographic information for the writers. Kaggle competition 2021: Feedback Prize - Evaluating Student Writing

corpus

ELLIPSE English Language Learner Insight, Proficiency and Skills Evaluation Corpus

ELLIPSE comprises ~7,000 essays written by English Language Learners (ELL). The essays were written on 29 different independent prompts that required no background knowledge on the part of the writer. Individual difference information is made available for each essay, including economic status, gender, grade level (8-12), and race/ethnicity. Each essay was scored by two normed human raters for English language proficiency including an overall score of English proficiency and analytic scores for cohesion, syntax, vocabulary, phraseology, grammar, and conventions.

corpus