Dataset Catalog

Title Platform/Publisher Dataset Name Description URL Year Data Formats
iSnap - Introductory Programming DataShop iSnap - Introductory Programming iSnap logs all student actions to a remote database, including any interactions with the user interface and coding area. It also logs complete snapshots of students’ code after each edit, allowing for complete replay of a student’s actions within the environment. Link 2017 txt, ProgSnap
Scratch Dataset GitHub Scratch Dataset A dataset of 250K recent Scratch projects from 100K different authors scraped from the Scratch project repository. We processed the projects' source code and metadata to encode them into a database that facilitates querying and further analysis Link 2017 JSON
ShortAnswersIDSV Harvard Dataverse ShortAnswersIDSV This data set contains exam questions and answers from an introductory course to computer science Link 2022 tsv
Supplementary data of user study DataverseNO Supplementary data for study Supplementary data for study: Challenges Faced by Teaching Assistants in Computer Science Education Across Europe. This data includes the themes, sub-themes, codes and exemplary quotes from the analysis of reflection essays for the study "Challenges Faced by TAs in CS Education Across Europe". Link 2021 tsv, txt
CSEDM 2019 Data Challenge DataShop CSEDM 2019 Data Challenge The dataset used in the challenge comes from a study of novice Python programmers working with the ITAP intelligent tutoring system. For more information on the original experiment, see . There are 89 total students represented, and they worked on 38 problems over time. The study lasted over 7 weeks. The students could attempt the problems in any order, though there was a default order. Students could attempt the problem any number of times, receiving feedback from test cases each time, and they could also request hints from ITAP (though access was limited for students, depending on the week and their experimental condition). The dataset itself contains a record for each attempt and hint request that students made while working. Link 2019 CSV
CodeWorkout data Spring 2019 DataShop CodeWorkout data Spring 2019 Code workout data from Spring 2019 from coding exercises Link 2019 CSV
Supplementary data for study: Study Behaviors and Educational Design DataVerseNO Supplementary data for study: Study Behaviors and Educational Design Supplementary data for study: Understanding the Relation Between Study Behaviors and Educational Design (Study 1). It has been identified that the first-year experience is crucial to student motivation and throughput of study programs, therefore it is interesting to look at the state of the art of computer science study programs in Norway. This data is part of a PhD project and relates to Study 1. In this study we present a survey and study of the number of undergraduate computer science programs in Norway and map their characteristics in order to gather an up to date overview of the selection of programs. Through a systematic review of all Norwegian undergraduate programs using data from national databases we have found that there are 12 institutions offering 56 different programs in Norway in 2018. The study showed that the characteristics of these programs vary, that is, the amount of computer science courses during the first year, the number of students, admission requirements, student satisfaction and time commitment. This article presents these findings along with an analysis of what characteristics impact the students’ contentment and learning experience. Link 2021 CSV, txt
Code Hunt GitHub Code Hunt Code Hunt is a serious education game which has been played by over 140,000 students and enthusiasts over the past year. In the process we have collected over 1.5M programs, which we can link to specific users at specific levels of expertise. We hope that researchers will embark on research into the data, discovering how coders code and how technology can be used to make the process more accurate and less painful. Although there has been research on how students code in the past, Microsoft Research is offering a unique opportunity to conduct research on large, common data sets. This preview data set contains the programs written by students (only) worldwide during a contest over 48 hours. There are approximately 250 users, 24 puzzles and about 13,000 programs. Link 2015 cs, Java, py
2019 CS1 Keystroke Data Harvard Dataverse 2019 CS1 Keystroke Data Keystroke data collected from CS1 student participants during 2019 at Utah State University. See readme.txt for detailed information. This dataset has undergone deidentification, though it is possible, being a complex, temporal, and ephemeral dataset, that identifying keystrokes may have been missed. Ethical use of this dataset includes avoiding attempts at reconstructing identities. That said, if researchers discover anything identifiable in the data, they are encouraged to contact the dataset authors (john.edwards@usu.edu). Link 2019 CSV
2021 CS1 Keystroke Data Harvard Dataverse 2021 CS1 Keystroke Data Keystroke data collected from CS1 student participants during fall 2021 semester at Utah State University. See readme.txt for detailed information. This dataset has undergone deidentification, though it is possible, being a complex, temporal, and ephemeral dataset, that identifying keystrokes may have been missed. Ethical use of this dataset includes avoiding attempts at reconstructing identities. That said, if researchers discover anything identifiable in the data, they are encouraged to contact the dataset authors (john.edwards@usu.edu). Link 2022 CSV, PDF, py, tsv
CloudCoder GitHub CloudCoder CloudCoder is an open source web-based programming exercise system (inspired by CodingBat). It is designed to make it easy for instructors of introductory programming courses to assign short exercises to students for skills development and assessment. Currently, exercises in C/C++, Java, Python, and Ruby are supported. Link 2013 N/A
OLI Introductory Programming with Media DataShop OLI Introductory Programming with Media OLI Introductory Programming with Media - Fall 2010 Link 2010 Datashop
KC Modeling for Programming DataShop KC Modeling for Programming Step-by-step analysis of students solving introductory programming questions in Python Link 2016 rtf
Fall 2019 use of OpenDSA Formal Languages eTextbook DataShop Fall 2019 use of OpenDSA Formal Languages eTextbook Student utilization of e textbook, student perceptions and performance on exams Link 2019 CSV
E-learning Design Course Instances DataShop E-learning Design Course Instances E-learning Design Course Instances Link 2022 txt, Datashop, CSV
CodeBench CodeBench CodeBench CodeBench is a Programming Online Judge developed by the Institute of Computing (IComp) of the Federal University of Amazonas, Brazil. Through Codebench, teachers can provide lists of programming exercises to their students, who in turn must develop solutions for each exercise through an embedded IDE. Once a student submits a source code for a given exercise, the system instantly notifies the student whether him/her solution is correct or not. The CodeBench automatically logs all actions performed by students on embedded IDE during their attempts to solve the proposed exercises. This dataset contains all logs collected from CS1 students during 2016 to 2022. Link 2023 data
A Systematic Literature Review dataset Zenodo A Systematic Literature Review dataset How Creatively Are We Teaching and Assessing Creativity in Computing Education: A Systematic Literature Review Link 2021 xlsx
METRECC Africa 2020 data Apollo - University of Cambridge Repository METRECC Africa 2020 data This file includes the responses from the 58 study participants to the survey questions on demographics, years of teaching experience, qualifications, classroom time, topics covered in computer science teaching, capacity in terms of support and resources available and barriers experienced for professional development Link 2022 xls, txt, CSV
FalconCode FalconCode FalconCode FalconCode -a collection of over 1.5 million Python programs from over two thousand undergraduate students capturesoverfivesemestersworthofcodesamplesfromourintroductiontocomputingcourse,whichistakenbyeverystudent regardlessof theiracademicmajor. Link 2022 0
IDE Action Log Dataset from a CS1 MOOC Zenodo IDE Action Log Dataset from a CS1 MOOC This is a a dataset containing Integrated Development Environment (IDE) logs from an introductory programming MOOC. The dataset contains information on when actions in the IDE were performed in relation to deadlines over the different parts of the course. One exceptional aspect of the dataset is that part of the logs have been gathered at the keystroke level, allowing for fine-grained insight into the learning process. In addition to the IDE logs themselves, the dataset has information on whether students included in the data passed the course. This can facilitate further research that analyzes how time-related behavior relates to performance in introductory programming courses. Link 2017 CSV
Conventional vs a constructionist-Scratch programming instructions Mendeley Data Conventional vs a constructionist-Scratch programming instructions. The Conventional versus a constructionist-Scratch programming instructions and students achievements in higher education CS1 classes. Link 2022 xlsx
Dataset: Recursive problem solving in the online learning environment CodingBat by computer science students DZHW Dataset: Recursive problem solving in the online learning environment CodingBat by computer science students The data package has been gathered within the scope of the dissertation of the data provider, Natalie Kiesler. It focuses on informative feedback, its design, and its implementation in basic programming education Link 2017 xlsx, txt, PDF
Programming steps working group at ITiCSE'22 GitHub Programming steps working group at ITiCSE'22 The data is from an online introductory programming course using Dart language. The students have varied backgrounds and study from distance. The course is available at https://fitech101.aalto.fi/fitech101/introduction-to-programming/ in Finnish. For the same reason the programs in this data are likely to include e.g. variable and function names in Finnish. The published data includes * 2 assignments, see TaskDescription-A*.txt * 5 students in A79 assignment * 20 students in A81 assignment Link 2022 CSV
Concept Map for Cybersecurity Courses GitLab Concept Map for Cybersecurity Courses Concept Map for Cybersecurity Courses Link 2019 cmap
Cybersecurity Literature Review Zenodo Cybersecurity Literature Review This paper discusses trends,and implications for further research in cybersecurity education. Link 2019 xlsx, CSV
Distributed System Syllabi Zenodo Distributed System Syllabi authors try to map 51 offerings of distributed systems courses from different schools to two popular curriculum initiatives Link 2020 CSV
Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments Zenodo Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments Link 2022 CSV
Group Work in Learning Programming DZHW Group Work in Learning Programming The research project "Digital Programming in Teams" (DiP-iT) investigates how collaborative learning in computer science studies can be didactically developed and supported with digital tools. The project focuses on the use and implementation of learning analytics methods. The DiP-iT project aims to develop didactic and technical support in computer science studies for learning to program in teams. In order to achieve this goal, the sub-study "Gruppenarbeit beim Programmieren lernen" (GAPL; engl: Group work in Learning Programming) first took stock of the initial situation at the three locations of the joint universities (TU Bergakademie Freiberg, Otto-von-Guericke University Magdeburg and Humboldt University Berlin). Interviews with lecturers (data set 1) and students (data set 2) of the three participating universities were conducted to determine the initial situation. It was analyzed to what extent cooperative and collaborative methods are already used in teaching to learn programming. Subsequently, it can be deduced, among other things, how collaborative programming learning can be supported in the future (technically, didactically, organizationally, etc.), where unused potentials lie and which obstacles must be considered in an implementation. The lecturers hold basic courses on learning to program in computer science and were asked, for example, about the extent to which they use group work in their courses and what opportunities and risks they see in group work. The students of computer science have already participated in basic courses on learning to program in computer science or are participating in them in the semester of the interview. Among other things, they were asked about the extent to which they used group work in the course, how they experienced group work, and what opportunities and risks they see in group work. Link 2023 docx
Discovering Misconceptions in formal methods using ITS OSF Discovering Misconceptions in formal methods using ITS In this data repository we store the data for the paper Discovering and quantifying misconceptions in formal methods using intelligent tutoring systems. The paper describes a quantitative study to analyze candidates for misconceptions at modeling with propositional logic, modal logic, and first-order logic. For this, we use data from the intelligent tutoring system Iltis (https://iltis.cs.tu-dortmund.de). This repository includes the following data: for each type of logic, which statements were to be modeled and which operators they contained (see this page), for each statement, its difficulty and popularity (see this page), and the data for the analyses discussed in the paper, that is the difficulty of single linguistic operators and typical mistakes (see this page and this page). Link 2022 CSV
CS1QA GitHub CS1QA Repository for CS1QA: A Dataset for assisting Code-based Question Answering in an Introductory Programming Course, published at NAACL 2022 The annotated data can be found in this repository under the folder data. Due to the size of the unannotated chat and code data, we are unable to upload them to GitHub. If you want to get the data, please email the author at changyoon.lee@kaist.ac.kr. The authors have made their best efforts to anonymize the data. However, there might be some personal or sensitive information left unfound in the dataset. By using the dataset, you agree to not abuse the personal or sensitive information that might be present in the dataset, and also notify the authors at changyoon.lee@kaist.ac.kr once you come across such data. Link 2022 JSON, py, sh, txt
Artifacts of FSE-2017 paper on an Intelligent Tutoring System for Programming Github Artifacts of FSE-2017 paper on an Intelligent Tutoring System for Programming In our ESEC/FSE-17 paper titled A Feasibility Study of Using Automated Program Repair for Introductory Programming Assignments, we apply four state-of-the-art automated program repair (APR) tools to student programs collected from an Introductory C Programming course (CS-101) offered in Indian Institute of Technology Kanpur (IIT-K). To overcome the low repair rate of APR tools (due to that student programs are often severely incorrect), we introduce a new repair policy and strategy tailored to programming tutoring (described in Section 6 of our paper), which is implemented in our toolchain. In this repository, we share the artifacts we used in our study. Our artifacts consist of (1) dataset containing student programs, (2) toolchain, and (3) user study materials (we conducted a user study with students and teaching assistants to see the feasibility of using APR tools). Link 2017 CSV, txt, c
Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant Zeondo Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA – Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers, or benchmarking mutation testing frameworks, and more applications are yet to be discovered. This dataset is a supplementary material for a paper entitled Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant submitted to the MDPI Data journal. Link 2023 CSV
Dataset for the evaluation of student-level outcomes of a primary school Computer Science curricular reform Zenodo Dataset for the evaluation of student-level outcomes of a primary school Computer Science curricular reform Student learning and perception data from three studies with respectively 1384, 2433 and 1644 grade 3-6 students (ages 7-11) and their 83, 142 and 95 teachers. Link 2023 CSV
Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study OSF Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study Link 2022 CSV, xlsx

This dataset catalog is a compilation of open-source datasets in computing education, curated by the "Where is the data? Finding and reusing datasets in computing education" CompEd 23' working group. The working group aims to make research data more accessible and encourage open data practices in the computing education research (CER) community. For more information, please refer to the working group's paper: Kiesler, Natalie, John Impagliazzo, Katarzyna Biernacka, Amanpreet Kapoor, Zain Kazmi, Sujeeth Goud Ramagoni, Aamod Sane, Keith Tran, Shubbhi Taneja, and Zihan Wu. "Where's the Data? Exploring Datasets in Computing Education." In Proceedings of the ACM Conference on Global Computing Education Vol 2, pp. 209-210. 2023.