Dataset Catalog

Title	Platform/Publisher	Dataset Name	Description	URL	Year	Data Formats
iSnap - Introductory Programming	DataShop	iSnap - Introductory Programming	iSnap logs all student actions to a remote database, including any interactions with the user interface and coding area. It also logs complete snapshots of students’ code after each edit, allowing for complete replay of a student’s actions within the environment.	Link	2017	txt, ProgSnap
Scratch Dataset	GitHub	Scratch Dataset	A dataset of 250K recent Scratch projects from 100K different authors scraped from the Scratch project repository. We processed the projects' source code and metadata to encode them into a database that facilitates querying and further analysis	Link	2017	JSON
ShortAnswersIDSV	Harvard Dataverse	ShortAnswersIDSV	This data set contains exam questions and answers from an introductory course to computer science	Link	2022	tsv
Supplementary data of user study	DataverseNO	Supplementary data for study	Supplementary data for study: Challenges Faced by Teaching Assistants in Computer Science Education Across Europe. This data includes the themes, sub-themes, codes and exemplary quotes from the analysis of reflection essays for the study "Challenges Faced by TAs in CS Education Across Europe".	Link	2021	tsv, txt
CSEDM 2019 Data Challenge	DataShop	CSEDM 2019 Data Challenge	The dataset used in the challenge comes from a study of novice Python programmers working with the ITAP intelligent tutoring system. For more information on the original experiment, see . There are 89 total students represented, and they worked on 38 problems over time. The study lasted over 7 weeks. The students could attempt the problems in any order, though there was a default order. Students could attempt the problem any number of times, receiving feedback from test cases each time, and they could also request hints from ITAP (though access was limited for students, depending on the week and their experimental condition). The dataset itself contains a record for each attempt and hint request that students made while working.	Link	2019	CSV
CodeWorkout data Spring 2019	DataShop	CodeWorkout data Spring 2019	Code workout data from Spring 2019 from coding exercises	Link	2019	CSV
Supplementary data for study: Study Behaviors and Educational Design	DataVerseNO	Supplementary data for study: Study Behaviors and Educational Design	Supplementary data for study: Understanding the Relation Between Study Behaviors and Educational Design (Study 1). It has been identified that the first-year experience is crucial to student motivation and throughput of study programs, therefore it is interesting to look at the state of the art of computer science study programs in Norway. This data is part of a PhD project and relates to Study 1. In this study we present a survey and study of the number of undergraduate computer science programs in Norway and map their characteristics in order to gather an up to date overview of the selection of programs. Through a systematic review of all Norwegian undergraduate programs using data from national databases we have found that there are 12 institutions offering 56 different programs in Norway in 2018. The study showed that the characteristics of these programs vary, that is, the amount of computer science courses during the first year, the number of students, admission requirements, student satisfaction and time commitment. This article presents these findings along with an analysis of what characteristics impact the students’ contentment and learning experience.	Link	2021	CSV, txt
Code Hunt	GitHub	Code Hunt	Code Hunt is a serious education game which has been played by over 140,000 students and enthusiasts over the past year. In the process we have collected over 1.5M programs, which we can link to specific users at specific levels of expertise. We hope that researchers will embark on research into the data, discovering how coders code and how technology can be used to make the process more accurate and less painful. Although there has been research on how students code in the past, Microsoft Research is offering a unique opportunity to conduct research on large, common data sets. This preview data set contains the programs written by students (only) worldwide during a contest over 48 hours. There are approximately 250 users, 24 puzzles and about 13,000 programs.	Link	2015	cs, Java, py
2019 CS1 Keystroke Data	Harvard Dataverse	2019 CS1 Keystroke Data	Keystroke data collected from CS1 student participants during 2019 at Utah State University. See readme.txt for detailed information. This dataset has undergone deidentification, though it is possible, being a complex, temporal, and ephemeral dataset, that identifying keystrokes may have been missed. Ethical use of this dataset includes avoiding attempts at reconstructing identities. That said, if researchers discover anything identifiable in the data, they are encouraged to contact the dataset authors (john.edwards@usu.edu).	Link	2019	CSV
2021 CS1 Keystroke Data	Harvard Dataverse	2021 CS1 Keystroke Data	Keystroke data collected from CS1 student participants during fall 2021 semester at Utah State University. See readme.txt for detailed information. This dataset has undergone deidentification, though it is possible, being a complex, temporal, and ephemeral dataset, that identifying keystrokes may have been missed. Ethical use of this dataset includes avoiding attempts at reconstructing identities. That said, if researchers discover anything identifiable in the data, they are encouraged to contact the dataset authors (john.edwards@usu.edu).	Link	2022	CSV, PDF, py, tsv
CloudCoder	GitHub	CloudCoder	CloudCoder is an open source web-based programming exercise system (inspired by CodingBat). It is designed to make it easy for instructors of introductory programming courses to assign short exercises to students for skills development and assessment. Currently, exercises in C/C++, Java, Python, and Ruby are supported.	Link	2013	N/A
OLI Introductory Programming with Media	DataShop	OLI Introductory Programming with Media	OLI Introductory Programming with Media - Fall 2010	Link	2010	Datashop
KC Modeling for Programming	DataShop	KC Modeling for Programming	Step-by-step analysis of students solving introductory programming questions in Python	Link	2016	rtf
Fall 2019 use of OpenDSA Formal Languages eTextbook	DataShop	Fall 2019 use of OpenDSA Formal Languages eTextbook	Student utilization of e textbook, student perceptions and performance on exams	Link	2019	CSV
E-learning Design Course Instances	DataShop	E-learning Design Course Instances	E-learning Design Course Instances	Link	2022	txt, Datashop, CSV
CodeBench	CodeBench	CodeBench	CodeBench is a Programming Online Judge developed by the Institute of Computing (IComp) of the Federal University of Amazonas, Brazil. Through Codebench, teachers can provide lists of programming exercises to their students, who in turn must develop solutions for each exercise through an embedded IDE. Once a student submits a source code for a given exercise, the system instantly notifies the student whether him/her solution is correct or not. The CodeBench automatically logs all actions performed by students on embedded IDE during their attempts to solve the proposed exercises. This dataset contains all logs collected from CS1 students during 2016 to 2022.	Link	2023	data
A Systematic Literature Review dataset	Zenodo	A Systematic Literature Review dataset	How Creatively Are We Teaching and Assessing Creativity in Computing Education: A Systematic Literature Review	Link	2021	xlsx
METRECC Africa 2020 data	Apollo - University of Cambridge Repository	METRECC Africa 2020 data	This file includes the responses from the 58 study participants to the survey questions on demographics, years of teaching experience, qualifications, classroom time, topics covered in computer science teaching, capacity in terms of support and resources available and barriers experienced for professional development	Link	2022	xls, txt, CSV
FalconCode	FalconCode	FalconCode	FalconCode -a collection of over 1.5 million Python programs from over two thousand undergraduate students capturesoverfivesemestersworthofcodesamplesfromourintroductiontocomputingcourse,whichistakenbyeverystudent regardlessof theiracademicmajor.	Link	2022	0
IDE Action Log Dataset from a CS1 MOOC	Zenodo	IDE Action Log Dataset from a CS1 MOOC	This is a a dataset containing Integrated Development Environment (IDE) logs from an introductory programming MOOC. The dataset contains information on when actions in the IDE were performed in relation to deadlines over the different parts of the course. One exceptional aspect of the dataset is that part of the logs have been gathered at the keystroke level, allowing for fine-grained insight into the learning process. In addition to the IDE logs themselves, the dataset has information on whether students included in the data passed the course. This can facilitate further research that analyzes how time-related behavior relates to performance in introductory programming courses.	Link	2017	CSV
Conventional vs a constructionist-Scratch programming instructions	Mendeley Data	Conventional vs a constructionist-Scratch programming instructions.	The Conventional versus a constructionist-Scratch programming instructions and students achievements in higher education CS1 classes.	Link	2022	xlsx
Dataset: Recursive problem solving in the online learning environment CodingBat by computer science students	DZHW	Dataset: Recursive problem solving in the online learning environment CodingBat by computer science students	The data package has been gathered within the scope of the dissertation of the data provider, Natalie Kiesler. It focuses on informative feedback, its design, and its implementation in basic programming education	Link	2017	xlsx, txt, PDF
Programming steps working group at ITiCSE'22	GitHub	Programming steps working group at ITiCSE'22	The data is from an online introductory programming course using Dart language. The students have varied backgrounds and study from distance. The course is available at https://fitech101.aalto.fi/fitech101/introduction-to-programming/ in Finnish. For the same reason the programs in this data are likely to include e.g. variable and function names in Finnish. The published data includes * 2 assignments, see TaskDescription-A.txt 5 students in A79 assignment * 20 students in A81 assignment	Link	2022	CSV
Concept Map for Cybersecurity Courses	GitLab	Concept Map for Cybersecurity Courses	Concept Map for Cybersecurity Courses	Link	2019	cmap
Cybersecurity Literature Review	Zenodo	Cybersecurity Literature Review	This paper discusses trends,and implications for further research in cybersecurity education.	Link	2019	xlsx, CSV
Distributed System Syllabi	Zenodo	Distributed System Syllabi	authors try to map 51 offerings of distributed systems courses from different schools to two popular curriculum initiatives	Link	2020	CSV
Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments	Zenodo	Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments	Supplementary materials for the paper Hyperstyle : A Tool for Assessing the Code Quality of Solutions to Programming Assignments	Link	2022	CSV
Group Work in Learning Programming	DZHW	Group Work in Learning Programming	The research project "Digital Programming in Teams" (DiP-iT) investigates how collaborative learning in computer science studies can be didactically developed and supported with digital tools. The project focuses on the use and implementation of learning analytics methods. The DiP-iT project aims to develop didactic and technical support in computer science studies for learning to program in teams. In order to achieve this goal, the sub-study "Gruppenarbeit beim Programmieren lernen" (GAPL; engl: Group work in Learning Programming) first took stock of the initial situation at the three locations of the joint universities (TU Bergakademie Freiberg, Otto-von-Guericke University Magdeburg and Humboldt University Berlin). Interviews with lecturers (data set 1) and students (data set 2) of the three participating universities were conducted to determine the initial situation. It was analyzed to what extent cooperative and collaborative methods are already used in teaching to learn programming. Subsequently, it can be deduced, among other things, how collaborative programming learning can be supported in the future (technically, didactically, organizationally, etc.), where unused potentials lie and which obstacles must be considered in an implementation. The lecturers hold basic courses on learning to program in computer science and were asked, for example, about the extent to which they use group work in their courses and what opportunities and risks they see in group work. The students of computer science have already participated in basic courses on learning to program in computer science or are participating in them in the semester of the interview. Among other things, they were asked about the extent to which they used group work in the course, how they experienced group work, and what opportunities and risks they see in group work.	Link	2023	docx
Discovering Misconceptions in formal methods using ITS	OSF	Discovering Misconceptions in formal methods using ITS	In this data repository we store the data for the paper Discovering and quantifying misconceptions in formal methods using intelligent tutoring systems. The paper describes a quantitative study to analyze candidates for misconceptions at modeling with propositional logic, modal logic, and first-order logic. For this, we use data from the intelligent tutoring system Iltis (https://iltis.cs.tu-dortmund.de). This repository includes the following data: for each type of logic, which statements were to be modeled and which operators they contained (see this page), for each statement, its difficulty and popularity (see this page), and the data for the analyses discussed in the paper, that is the difficulty of single linguistic operators and typical mistakes (see this page and this page).	Link	2022	CSV
CS1QA	GitHub	CS1QA	Repository for CS1QA: A Dataset for assisting Code-based Question Answering in an Introductory Programming Course, published at NAACL 2022 The annotated data can be found in this repository under the folder data. Due to the size of the unannotated chat and code data, we are unable to upload them to GitHub. If you want to get the data, please email the author at changyoon.lee@kaist.ac.kr. The authors have made their best efforts to anonymize the data. However, there might be some personal or sensitive information left unfound in the dataset. By using the dataset, you agree to not abuse the personal or sensitive information that might be present in the dataset, and also notify the authors at changyoon.lee@kaist.ac.kr once you come across such data.	Link	2022	JSON, py, sh, txt
Artifacts of FSE-2017 paper on an Intelligent Tutoring System for Programming	Github	Artifacts of FSE-2017 paper on an Intelligent Tutoring System for Programming	In our ESEC/FSE-17 paper titled A Feasibility Study of Using Automated Program Repair for Introductory Programming Assignments, we apply four state-of-the-art automated program repair (APR) tools to student programs collected from an Introductory C Programming course (CS-101) offered in Indian Institute of Technology Kanpur (IIT-K). To overcome the low repair rate of APR tools (due to that student programs are often severely incorrect), we introduce a new repair policy and strategy tailored to programming tutoring (described in Section 6 of our paper), which is implemented in our toolchain. In this repository, we share the artifacts we used in our study. Our artifacts consist of (1) dataset containing student programs, (2) toolchain, and (3) user study materials (we conducted a user study with students and teaching assistants to see the feasibility of using APR tools).	Link	2017	CSV, txt, c
Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant	Zeondo	Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant	The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA – Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers, or benchmarking mutation testing frameworks, and more applications are yet to be discovered. This dataset is a supplementary material for a paper entitled Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant submitted to the MDPI Data journal.	Link	2023	CSV
Dataset for the evaluation of student-level outcomes of a primary school Computer Science curricular reform	Zenodo	Dataset for the evaluation of student-level outcomes of a primary school Computer Science curricular reform	Student learning and perception data from three studies with respectively 1384, 2433 and 1644 grade 3-6 students (ages 7-11) and their 83, 142 and 95 teachers.	Link	2023	CSV
Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study	OSF	Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study	Unravelling the numerical and spatial underpinnings of computational thinking: a pre-registered replication study	Link	2022	CSV, xlsx