GitDelver Enterprise Dataset (GDED): An Industrial Closed-source Dataset for Socio-Technical Research

Abstract

Conducting socio-technical software engineering research on closed-source software is difficult as most organizations do not want to give access to their code repositories. Most experiments and publications therefore focus on open-source projects, which only provides a partial view of software development communities. Yet, closing the gap between open and closed source software industries is essential to increase the validity and applicability of results stemming from socio-technical software engineering research. We contribute to this effort by sharing our work in a large company counting 4,800 employees. We mined 101 repositories and produced the GDED dataset containing socio-technical information about 106,216 commits, 470,940 file modifications and 3,471,556 method modifications from 164 developers during the last 13 years, using various programming languages. For that, we used GitDelver, an open-source tool we developed on top of Pydriller, and anonymized and scrambled the data to comply with legal and corporate requirements. Our dataset can be used for various purposes and provides information about code complexity, self-admitted technical debt, bug fixes, as well as temporal information. We also share our experience regarding the processing of sensitive data to help other organizations making datasets publicly available to the research community.

Type
Publication
19th International Conference on Mining Software Repositories (MSR ‘22)
Nicolas Riquet
Nicolas Riquet
PhD Student

I am the head of Software Engineering Productivity at Le Forem, a public service for employment and vocational training in Wallonia, Belgium, where I lead a team of 21 employees and consultants. My research goal is to develop a holistic approach to socio-technical debt by designing a framework for helping developers and managers to address software debt and prioritize mitigating actions.

Xavier Devroey
Xavier Devroey
Assistant Professor of Software Testing

My research goal is to to ease software testing by exploring new paths to achieve a high level of automation for test case design, generation, selection, and prioritization. My main research interests include search-based and model-based software testing, test suite augmentation, DevOps, and variability-intensive systems.

Benoît Vanderose
Benoît Vanderose
Assistant Professor of Software Engineering