May 24, 2021 By Victoria Grdina
Computer science and engineering assistant professor Robert Dyer and his students, Samuel W. Flint and Jigyasa Chauhan, were honored with a 2021 Distinguished Paper Award.
Dyer and his students received the award for their work on investigating the use of time-based data at the 18th International Conference on Mining Software Repositories (MSR), which was held virtually this year.
The paper, titled "Escaping the Time Pit: Pitfalls and Guidelines for using Time-Based Git Data", was co-authored by Samuel W. Flint, Jigyasa Chauhan, and Robert Dyer, and will be presented by Flint.
The paper investigates how time-based data has been used over the course of the 16 years of the MSR conference. Time data appears in artifacts such as project metadata, Git repositories and commits, issue reports, pull requests, etc. The paper first surveys 690 prior research papers published at MSR to see how researchers have used time-based data in the past, finding that at least one third of papers have utilized this kind of data. Based on those results showing that Git commit data from GitHub is the most used kind of time data, the paper then investigates potential problems with Git commit data. Utilizing the Boainfrastructure, the paper looks at potential problems with Git commit timestamps such as invalid timestamps (e.g., the number 0 or -1), timestamps that appear to be too old (before version control was commonly used) or in the future, or commits that appear to have a date older than the parent commit's date. Overall, the investigation showed thousands of potentially bad commit timestamps from hundreds of projects, often the result of automated tool use. The paper then recommends guidelines to future researchers for filtering their data to avoid these potentially invalid timestamps, such as filtering data before January 2014 and removing a small set of projects that contain a lot of bad data.
Dyer is a member of ESQuaReD, or The Laboratory for Empirically-based Software Quality Research and Development. Flint, a co-author of the paper, is a first year Ph.D. student working in the field of empirical software engineering. Chauhan, a co-author of the paper, is a first year master's student working in the field of empirical software engineering and data mining.
View the full presentation, paper preprint, conference schedule, and the paper's dataset.
The International Conference on Mining Software Repositories (MSR) is the premier international forum for researchers and practitioners from academia, industry, and government to present, discuss, and debate the most recent ideas, experiences, and challenges in mining software repository data, such as that found on GitHub.