# ESEC/FSE 2023 Tutorial: Performing Large-scale Mining Studies, From Start to Finish

### Authors: Robert Dyer and Samuel W. Flint

This is a half-day tutorial on how to use Boa and Boa's study template to
perform a software repository mining study, from start to finish.  By the
end of the tutorial, users should be able to use the Boa infrastructure,
write Boa queries, download and analyze the output from Boa queries, and
publish their full study as a replication package on Zenodo.

### Helpful Links

* The Boa website:
  * https://boa.cs.iastate.edu/boa
* The Boa Programming Guide:
  * https://boa.cs.iastate.edu/docs/index.php
* Download Visual Studio Code:
  * https://code.visualstudio.com/download
* Download the Boa extension for VS Code:
  * https://marketplace.visualstudio.com/items?itemName=Boa.boalang
* The slides for the tutorial:
  * https://go.unl.edu/fse22-slides

## Hands-on Task 0: Boa+VS Code

The goal of this task is to get you familiar with interacting with Boa, using
the VS Code extension.  The first step is to [install Visual Studio
Code](https://code.visualstudio.com/download), if you have no already done so.
Once you have VS Code installed, be sure to [install the Boa
extension](https://marketplace.visualstudio.com/items?itemName=Boa.boalang).
You can do this by going to the extensions tab and searching for "boalang".
Make sure to also install the Python packages we rely on, with `pip3 install -r
requirements.txt` in the tutorial directory.

Once the extension is installed, you can view your list of Boa jobs by opening
the Explorer panel.  There should be a "Boa: Recent Jobs" view.  Hit the
refresh button.  The IDE extension will ask you for your Boa username and
password.  It stores both of these, so you should not need to re-authenticate
again.

You will also want to edit the file [`.env`](.env).  This file stores some
environment variables, including access names/tokens for the Boa and Zenodo
APIs.  First, add your Boa user to the environment:

```py
BOA_API_USER='<USERNAME_HERE>'
```

Next you can add your Boa password, though note this is **not secure** and we
generally do not recommend it!  You can add the password to the environment:

```py
BOA_API_PW='<PASSWORD_HERE>'
```

A better way to store your password is using your OS keyring.  You can install
the `keyring` Python package (`pip3 install keyring`) and then add your
password to the keyring by running `keyring set boaapi <username>` and
providing the password.  Then, as long as your keyring is unlocked, the scripts
should not prompt you for any credentials.

Now that you should have a working environment, let's try to run a Boa query and
view the output of the query.

Open the files [`boa/queries/task0/part1.boa`](boa/queries/task0/part1.boa) and
[`boa/queries/task0/part2.boa`](boa/queries/task0/part2.boa).  You can each of
these by using the "Download Output" button in
[`study-template.json`](study-template.json) for each output file.

----

## Looking at the Study Template

This work is based off of the [Boa Study
Template](https://github.com/boalang/study-template).  Documentation about the
study template is available in [README.md](README.md).

----

## Hands-on Task 1: A Simple Boa Query

Start by opening the file
[`boa/queries/task1/part1.boa`](boa/queries/task1/part1.boa).

Try to understand what the query is doing.  What are the key components?  What
is the logic?  Run the query.  What is the output?

You'll notice that there is no output.  This is because the dataset used does
not have any suspiciously old commits.  Create a new dataset in
[`study-config.json`](study-config.json) that points to `2019 October/GitHub
(medium)` and change the query to use this.  Try running the query again.

Now, copy [`boa/queries/task1/part1.boa`](boa/queries/task1/part1.boa) to
[`boa/queries/task1/part2.boa`](boa/queries/task1/part2.boa).  Modify the
visitor to only count commits which have at least one source file.  If you want
to run the new query, you'll need to make sure that you add a new entry to
`queries` in [`study-config.json`](study-config.json).  If you have an issue
downloading [`data/txt/task1/part2.txt`](data/txt/task1/part2.txt) because
`jsonschema` can not be found, comment out line 15 of [`Makefile`](Makefile).

**Hint:** Use `exists` to check if there is a source file in the visitor.

----

## Hands-on Task 2: Adding a Second Query

Let's start by opening
[`boa/queries/task2/task2.boa`](boa/queries/task2/task2.boa).  We have four
tasks to complete, but we will do them in an indirect order.

The first thing we should do is decide what the output should look like.  We're
already provided with an analysis script
[`analyses/task2.py`](analyses/task2.py), which tells us what is expected, what
should we output?

Now that we've decided on output, how should we generate it?  What do we want to
look at?  We need snapshots (how do those work?) and only valid Java files.  We
also need to look at only ENUMs: how are they modeled?  Finally, at what level
do we need to count?

We need a visitor for `input`, and will only want the latest revision of each
file, for this, use `snapshot` snippet.

Enums are modeled as Declarations, so we need to visit all Declarations, and if
it is an enum (`TypeKind.ENUM`), count it.

Finally, we need to count at the *file* level.  To do this, before files, we
should reset count to 0, and after files, if count is positive, output it.

This brings a slight problem: where is `count` defined?  We should make it global.

Let's go ahead an run the analysis now.

The Boa query runs, but the analysis fails: We're missing a necessary input,
`dupes.csv`!  This is generated from a Boa query (which is provided), we just
need to tell [`study-config.json`](study-config.json) how to build it.  Add the
following to `queries`:

```json
    "hashes.txt": {
      "query": "queries/hashes.boa",
      "dataset": "java",
      "processors": {
        "gendupes.py": {
          "output": "data/txt/dupes.txt",
          "cacheclean": [
            "*-deduped.parquet"
          ],
          "csv": "dupes.csv"
        }
      }
    },
```

Now try running the analysis again!

----

## Hands-on Task 3: Analyzing Boa Outputs with Pandas

We've written a Boa query, but now we want to analyze the results.  We're going
to be (partially) replicating the results of an ICSE 2014 study.  The goal is to
use Python to generate this table:

![Table from ICSE 2014](boa/queries/task3/task3goal.png)

Let's start by opening the file [`analyses/task3.py`](analyses/task3.py).  This
is a skeleton.  The TODOs are provided in the file, but we'll work through them
in a slightly different order.  It will be helpful to look at the other analyses
for several parts of this.

### TODO 0

Declare an analysis in `analyses` in the
[`study-config.json`](study-config.json).  What files are necessary?

**Hint:** Look at described queries to determine the correct inputs.

### TODO 1

We need to get data about totals into Python.  Loading data is best accomplished
with the `get_df` function from `common.df`.  Should it be deduped?  But we'll
also need to provide some information and ensure the structure of the dataframe
makes sense.  It may be helpful to use `drop` and `names` arguments to `get_df`,
to ensure the structure is correct.

We'll then need to get the total number of projects and files.  Try using
`df.loc` to do so.

### TODO 2

Now we need the task 3 data:  using `get_df`, what arguments need passed?
Should we drop a column?

### TODO 6

Now, let's output the dataframe as a LaTeX table.  We aren't done, of course,
but we can show *something*.  There are two things:

 1. Highlight row names and column names, and
 2. Output the table to a file.
 
To do the first, we'll want to get and modify a styler object.

For the second, let's use the `save_table` function from `common.tables`.

### TODO 4

The output table is quite long.  If we want to make it more readable, and closer
to the example output, we'll need to pivot it.

This can be done using `pd.pivot_table`.

### TODO 5

Column names and row names are out of order and difficult to read.  How can we
fix this?

**Hint:** Use `.rename()` and `.reindex()`.

### TODO 3

All we see are the raw values.  How can we change them?

**Hint:** Use `.apply()` to change values in a column.

**Hint:** Pass in a function that takes a Series, and returns a new, updated
series.

**Hint:** Think carefully about the logic behind how new values are determined.

----

## Hands-on Task 4: Publishing Replication Packages (Zenodo)

Conferences are moving more towards open science.  ESEC/FSE's research track
has an open science policy that states: "The research track of ESEC/FSE has
introduced an open science policy. Openness in science is key to fostering
scientific progress via transparency, reproducibility, and replicability. The
steering principle is that all research results should be accessible to the
public, if possible, and that empirical studies should be reproducible. In
particular, we actively support the adoption of open data and open source
principles and encourage all contributing authors to disclose (anonymized and
curated) data to increase reproducibility and replicability."

This is a welcome advance and researchers should embrace it with open arms!
Yet it does add some additional burden on researchers submitting technical
papers, as they need to be careful with tracking all artifacts used, ensuring
the data is allowed to be made public (and properly anonymized), and spending
some effort to make replication packages that are well organized and easy for
future researchers to navigate.

In this part of the tutorial, we aim to show researchers how if they have used
Boa and Boa's study template to design their reserach study, they can easily
publish their research artifacts to Zenodo.

Boa's study template has built in support for generating replication packages
(archived as a series of ZIP files).  It also has support for interacting with
Zenodo to publish those packages in a manner that also supports the
double-blind peer review process.

### Obtain Zenodo API Access

The first step is to make a user on Zenodo.  For purposes of the tutorial, we
are going to utilize Zenodo's sandbox server so none of the artifact(s) we
publish are permanently published.  This allows us to delete them later if we
wish, unlike an actual Zenodo record that is permanent once published.  We also
need to generate an access token so we can use Zenodo's API.  Follow the
following steps:

1. Go to: https://sandbox.zenodo.org/
2. Make a user (you can't use OAuth logins - make an actual user!)
3. Go to your user’s "Applications" settings
4. Click create a "+New token" to make a new Personal Access Token
5. Give it a name - any name works, but maybe "Boa" since we are using it from
   Boa's study template
6. Select first two scopes: deposit:actions and deposit:write
7. Click "Create"
8. Copy the generated token somewhere for later - you can’t access it again!

Now that you have an access token for Zenodo, you can tell the study template
about it and start using it.  Edit the [`.env`](.env) file to put your token:

```py
ZENODO_API_TOKEN='<paste here>'
ZENODO_API_ENDPOINT='https://sandbox.zenodo.org'
```

Now the scripts should be able to communicate with/manage your Zenodo record!

### Edit Zenodo Metadata

Before you upload your artifact, you need to provide the metadata associated
with it.  To do so, first edit the [`.zenodo.json`](.zenodo.json) file.  This
is the file that will contain all of your Zenodo record's metadata.

This file can be shared and by default is stored in Git.  If one does not
exist, the first time you run the command one will be generated and it will
stop processing to allow you time to edit it.  This file contains the metadata
for your record, including things like the title, description, creators, and
license info.  By default we selected CC-By-4.0 as the license, so feel
free to change it if needed.

For a double-blinded submission, you will want to ensure the creators are
listed as anonymous and the access rights set to "open":

```json
    "creators": [
        {
            "affiliation": "Anonymous",
            "name": "Anonymous"
        }
    ],
```

After your paper is published, you can update the metadata and re-run the
script to have it publish with your actual author name(s).  For now, these two
settings (anonymous names + open access) ensure you are properly blinding the
record while your paper is under review.

For more details on the metadata JSON format used, see this link:
https://developers.zenodo.org/#representation

### Deposit the Artifacts

Once the metadata is properly set, you can upload your artifact to Zenodo.  To
do so, first run:

> `make package`

This will generate the ZIP files that contain your artifact.  By default, the
template will generate three zip files: `replication-pkg.zip` that contains all
queries, scripts, build system, etc but **no data**, `data.zip` that contains
the raw TXT files from the queries, and `data-cache.zip` that contains the
cached Parquet files.  We generate three zips as some of the data can be quite
large, so this provides people the option to just inspect the queries/scripts
without downloading the data.  If they just want to re-run the scripts and
re-generate the figures/tables they can use the cache (generally much
smaller/faster).  And if they want to use the package for their own research
they can still download the raw files.  Note that even without the data files,
the `jobs.json` file is included and would allow them to download the query
output(s) via running `make`!

Once the ZIPs are generated, you can upload to Zenodo with `make zenodo`.  Note
that once a record is created, running this command again will simply
**update** the existing record's metadata and re-upload the files.  It does not
make a new record on every run.

At this point, the record exists on Zenodo, has the metadata you provided, and
all of the ZIP files we created.  But it is not yet published and thus not
visible to anyone!  You need to log into the Zenodo website, verify the record
looks ok, and then manually click the publish button to publish it.

### Un-blinding After Paper Acceptance

Once your paper is accepted, you need to un-blind your record.  This involves
simply updating the metadata to put all author information in, instead of
listing as anonymous.  Re-edit the [`.zenodo.json`](.zenodo.json) file to
update the author information and then re-run `make zenodo`.  This will update
just the metadata - the published files *will not change*.

Be sure to note the DOI of your Zenodo record and cite it in your paper's
camera ready version (if you have no already done so)!

# Solutions

Here we provide links to Gists that contain the solution to each task.  Some
solutions require editing multiple files, so be sure to grab the entire thing!

## Task 1

The solution may be found at https://gist.github.com/swflint/0ac4124e4869f475d832258b773aef34

## Task 2

The solution may be found at https://gist.github.com/swflint/7e6eb09a2b46d425447043f8173dd419

## Task 3

The solution may be found at https://gist.github.com/swflint/c8a28d41a13f0e3c7aed6eafd2317d3d
