Artifact for "Revisiting Test-Case Prioritization on Long-Running Test Suites" (ISSTA 2024)

This README is for the artifact of the research paper "Revisiting Test-Case Prioritization on Long-Running Test Suites" in ISSTA 2024. The "Getting Start" section provides a quick walkthrough on the general functionality (e.g., downloading and extracting data from more builds, running TCP techniques) of the artifact using one of the evaluated projects as an example. To use the full dataset we previously collected, please refer to the "Detailed Description" section.

Getting Start

Artifact setup

Required OS: Linux

Create a new conda environment and install artifact requirements:

# create a new conda environment
conda create -n lrts python=3.9 -y
conda activate lrts

# install python deps
pip install -r requirements.txt

# install R deps
sudo apt update
sudo apt install r-base r-base-dev -y
R -e "install.packages('agricolae',dependencies=TRUE, repos='http://cran.rstudio.com/')"

Go to the ./artifact folder to start running the artifact by following the steps below.

Specify an example project for the artifact

We will use one of the evaluated projects, activemq, to walk through the general functionality of the artifact. Go to const.py, locate variable PROJECTS, and comment out the other projects in PROJECTS except ACTIVEMQ.

Collect more builds from evaluated projects

We need a valid GitHub API token to query some build data from GitHub. Before running the artifact, please follow the official documentation to get a GitHub API token, and put the token in self.tokens in token_pool.py.

To collect data (e.g, test report, log, metadata) of more PR builds from the evaluated project, run:

# run scripts to collect raw data for more builds
./collect_builds.sh

# gather metadata of the collected builds
python build_dataset.py

# created dataset with confounding test labels
python extract_filtered_test_result.py

Running collect_builds.sh creates these folders:

metadata/: csv-formatted metadata of each collected PR build per project
prdata/: per-PR metadata in JSON
shadata/: code change info per build
testdata/: test reports and stdout logs per build
history/: cloned codebase for the project
processed_test_result/: parsed test results, each test-suite run is stored in a csv.zip

It will also generate metadata/dataset_init.csv which lists metadata for all collected PR builds.

Running python3 build_dataset.py creates a metadata csv for the collected dataset (metadata/dataset.csv), where each row is a test-suite run (unique by its <project, pr_name, build_id, stage_id> tuple). Note that the newly collected builds may all be passing and have not failed tests to be evaluated (see num_fail_class in the generated csv). In this case, please try another project.

Running python3 extract_filtered_test_result.py creates processed test results in csv.zip format where failures of inspected flaky tests, frequently failing tests, and first failure of a test, are labeled.

Evaluating on collected builds

This artifact also provides code that implements and runs TCP techniques in the paper on the collected build data.

Test feature collection

To extract test features, e.g., test duration, go to the ./evaluation folder and run:

./extract_test_features.sh

Then, to get data for information retrieval TCP, go to ./evaluation/information_retrieval, and run:

python extract_ir_body.py
python extract_ir_score.py

To get data for supervised learning TCP (e.g., training-testing split, models), run:

python extract_ml_data.py

To get data for reinforcement learning TCP, go to ./evaluation/reinforcement_learning, run:

python extract_rl_data.py

Measure TCP technique performance

To evaluate TCP techniques on the collected data, run:

python eval_main.py

Evaluation results will be saved as eval_outcome/[dataset_version]/[project_name]/[tcp_technique_name].csv.zip, in which the columns are: project, TCP technique, PR name, build id, stage id, run seed, [metric_value_1], [metric_value_2], ..., [metric_value_n].

There are three automatically generated [dataset_version]s: d_nofilter (corresponding to LRTS-All), d_jira_stageunique_freqfail (LRTS-DeConf), and d_first_jira_stageunique (LRTS-FirstFail).

Detailed Description

We provide the evaluation outcome data in the artifact such that one can reproduce results from the paper within a reasonable runtime. If you have run eval_main.py which produces new evaluation outcome data in the current repo, please run git restore . to restore the data before it is overwritten.

Reproducing results in the paper

The steps below produce tables (Table 8-10) and figures (Figure 2) in the "Evaluation" section of the paper, and the dataset summary table and figure (Table 2 and Figure 1).

Go to ./artifact/analysis_paper/ folder,

# produce the plot that shows distribution of APFD(c) values for all techniques (Figure 2)
# figure is saved to figures/eval-LRTS-DeConf.pdf
python plot_eval_outcome.py

# produce
#   1. the table that compares the basic TCP techniques versus hybrid TCP (Table 8)
#   2. the table that shows controlled experiment results on IR TCP (Table 9)
#   3. the table that compares basic TCP techniques across all dataset versions (Table 10)
# tables are printed to stdout in csv format
python table_eval_outcome.py


# produce 
#   1. dataset summary table (Table 2)
#   2. CDF plot that shows distributions of test suite duration (hours) and size per project (Figure 1)
# results will be saved to dataset_viz/
python viz_dataset.py

To download and use the full dataset we collected, please refer to the descriptions below.

The full dataset: LRTS

LRTS is the first, extensive dataset for test-case prioritization (TCP) focused on long-running test suites.

LRTS has 100K+ test-suite runs from 30K+ recent CI builds with real test failures, from recent codebases of 10 popular, large-scale, multi-PL, multi-module, open-source software projects: ActiveMQ, Hadoop, HBase, Hive, Jackrabbit Oak, James, Kafka, Karaf, Log4j 2, TVM.

Key Statistics

108,366 test-suite runs from 32,199 CI builds
49,674 failed test-suite runs (with at least one test failure) from 22,763 CI builds
Build history span: 2020 to 2024
Average test-suite run duration: 6.75 hours, with at least 75% of the runs last over 2 hours
Average number of executed test classes per run: 980
Average number of failed test classes per failed run: 5

Dataset

Go to this link to download the processed LRTS. It contains the metadata of the dataset, test results at the test class level, and code change data of each test-suite run. We are actively looking for online storage to host the raw version which takes ~100GBs.

Artifact

artifact folder contains our code for downloading more builds from the listed projects, our TCP technique code implementation, and experiment scripts. To run our scripts on the processed dataset above, please refer to the instructions in the artifact/README.md.

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
artifact		artifact
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_LRTS.md		README_LRTS.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artifact for "Revisiting Test-Case Prioritization on Long-Running Test Suites" (ISSTA 2024)

Getting Start

Artifact setup

Specify an example project for the artifact

Collect more builds from evaluated projects

Evaluating on collected builds

Test feature collection

Measure TCP technique performance

Detailed Description

Reproducing results in the paper

The full dataset: LRTS

Key Statistics

Dataset

Artifact

About

Releases

Packages

Languages

License

shuaiwang516/LRTS

Folders and files

Latest commit

History

Repository files navigation

Artifact for "Revisiting Test-Case Prioritization on Long-Running Test Suites" (ISSTA 2024)

Getting Start

Artifact setup

Specify an example project for the artifact

Collect more builds from evaluated projects

Evaluating on collected builds

Test feature collection

Measure TCP technique performance

Detailed Description

Reproducing results in the paper

The full dataset: LRTS

Key Statistics

Dataset

Artifact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages