Mining Ultra-Large-Scale Software Repositories

Consider answering a question such as "what are the average number of changed files per revision (churn rate) for all projects?" Answering this question ordinarily requires knowledge of (at a minimum):

  • mining project metadata,
  • mining code repository locations,
  • how to access those code repositories,
  • additional filtering code,
  • controller logic,
  • ...

Solving this task in Boa is much easier:

1# what are the churn rates for all projects
2p: Project = input;
3counts: output mean[string] of int;
4
5visit(p, visitor {
6	before node: Revision -> counts[p.id] << len(node.files);
7});

First, we declare the input to be of type Project and give it an alias. Next we declare the output, which accepts values of type integer and computes the mean of all integers it sees, indexed by a string. Then we visit the input data, including each code repository and each revision in the repositories. Finally when we see a Revision, we send the number of files changed in the revision to the output variable, indexed by this project's id. This produces the mean value, which is the churn rate.

Boa has over 1,400 registered users from 36 countries. Boa has been used in over 50 research papers and 15 theses.