Datasets and Competitions
Our research team creates large datasets that are publicly available. We lead analyses of these and other datasets using NLP techniques and host competitions to attract world-wide talent into learning analytics. We also use large datasets to fine-tune and domain adapt language models to inform educational interventions.
DatasetsCLASSEThe Common Lit Augmented Student Summary Evaluation corpus
This dataset contains around 24,000 student summaries from grades 3-12, scored on content and wording, for a competition aiming to predict similar scores on unseen topics.CLEARThe CommonLit EAse of Readability Corpus
CLEAR provides unique readability scores for ~5,000 excerpts leveled for 3rd-12th grade readers along with information about the excerpts’ year of publishing, genre, and other meta-data.PERSUADEPersuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements corpus
PERSUADE is an open-source corpus comprising over 25,000 essays annotated for argumentative and discourse elements and relationships between these elements. The corpus includes holistic quality scores for each essay and for each argumentative and discourse element. Lastly, PERSUADE includes detailed demographic information for the writers. Kaggle competition 2021: Feedback Prize - Evaluating Student WritingELLIPSEEnglish Language Learner Insight, Proficiency and Skills Evaluation Corpus
ELLIPSE comprises ~7,000 essays written by English Language Learners (ELL). The essays were written on 29 different independent prompts that required no background knowledge on the part of the writer. Individual difference information is made available for each essay, including economic status, gender, grade level (8-12), and race/ethnicity. Each essay was scored by two normed human raters for English language proficiency including an overall score of English proficiency and analytic scores for cohesion, syntax, vocabulary, phraseology, grammar, and conventions.