This README is for the artifact of the research paper "Revisiting Test-Case Prioritization on Long-Running Test Suites" in ISSTA 2024. The "Getting Start" section provides a quick walkthrough on the general functionality (e.g., downloading and extracting data from more builds, running TCP techniques) of the artifact using one of the evaluated projects as an example. To use the full dataset we previously collected, please refer to the "Detailed Description" section.
Required OS: Linux
Create a new conda environment and install artifact requirements:
# create a new conda environment
conda create -n lrts python=3.9 -y
conda activate lrts
# install python deps
pip install -r requirements.txt
# install R deps
sudo apt update
sudo apt install r-base r-base-dev -y
R -e "install.packages('agricolae',dependencies=TRUE, repos='http://cran.rstudio.com/')"
Go to the ./artifact
folder to start running the artifact by following the steps below.
We will use one of the evaluated projects, activemq
, to walk through the general functionality of the artifact. Go to const.py
, locate variable PROJECTS
, and comment out the other projects in PROJECTS
except ACTIVEMQ
.
We need a valid GitHub API token to query some build data from GitHub. Before running the artifact, please follow the official documentation to get a GitHub API token, and put the token in self.tokens
in token_pool.py
.
To collect data (e.g, test report, log, metadata) of more PR builds from the evaluated project, run:
# run scripts to collect raw data for more builds
./collect_builds.sh
# gather metadata of the collected builds
python build_dataset.py
# created dataset with confounding test labels
python extract_filtered_test_result.py
Running collect_builds.sh
creates these folders:
metadata/
: csv-formatted metadata of each collected PR build per projectprdata/
: per-PR metadata in JSONshadata/
: code change info per buildtestdata/
: test reports and stdout logs per buildhistory/
: cloned codebase for the projectprocessed_test_result/
: parsed test results, each test-suite run is stored in acsv.zip
It will also generate metadata/dataset_init.csv
which lists metadata for all collected PR builds.
Running python3 build_dataset.py
creates a metadata csv for the collected dataset (metadata/dataset.csv
), where each row is a test-suite run (unique by its <project, pr_name, build_id, stage_id> tuple). Note that the newly collected builds may all be passing and have not failed tests to be evaluated (see num_fail_class
in the generated csv). In this case, please try another project.
Running python3 extract_filtered_test_result.py
creates processed test results in csv.zip
format where failures of inspected flaky tests, frequently failing tests, and first failure of a test, are labeled.
This artifact also provides code that implements and runs TCP techniques in the paper on the collected build data.
To extract test features, e.g., test duration, go to the ./evaluation
folder and run:
./extract_test_features.sh
Then, to get data for information retrieval TCP, go to ./evaluation/information_retrieval
, and run:
python extract_ir_body.py
python extract_ir_score.py
To get data for supervised learning TCP (e.g., training-testing split, models), run:
python extract_ml_data.py
To get data for reinforcement learning TCP, go to ./evaluation/reinforcement_learning
, run:
python extract_rl_data.py
To evaluate TCP techniques on the collected data, run:
python eval_main.py
Evaluation results will be saved as eval_outcome/[dataset_version]/[project_name]/[tcp_technique_name].csv.zip
, in which the columns are: project, TCP technique, PR name, build id, stage id, run seed, [metric_value_1], [metric_value_2], ..., [metric_value_n].
There are three automatically generated [dataset_version]
s: d_nofilter
(corresponding to LRTS-All), d_jira_stageunique_freqfail
(LRTS-DeConf), and d_first_jira_stageunique
(LRTS-FirstFail).
We provide the evaluation outcome data in the artifact such that one can reproduce results from the paper within a reasonable runtime. If you have run eval_main.py
which produces new evaluation outcome data in the current repo, please run git restore .
to restore the data before it is overwritten.
The steps below produce tables (Table 8-10) and figures (Figure 2) in the "Evaluation" section of the paper, and the dataset summary table and figure (Table 2 and Figure 1).
Go to ./artifact/analysis_paper/
folder,
# produce the plot that shows distribution of APFD(c) values for all techniques (Figure 2)
# figure is saved to figures/eval-LRTS-DeConf.pdf
python plot_eval_outcome.py
# produce
# 1. the table that compares the basic TCP techniques versus hybrid TCP (Table 8)
# 2. the table that shows controlled experiment results on IR TCP (Table 9)
# 3. the table that compares basic TCP techniques across all dataset versions (Table 10)
# tables are printed to stdout in csv format
python table_eval_outcome.py
# produce
# 1. dataset summary table (Table 2)
# 2. CDF plot that shows distributions of test suite duration (hours) and size per project (Figure 1)
# results will be saved to dataset_viz/
python viz_dataset.py
To download and use the full dataset we collected, please refer to the descriptions below.
LRTS is the first, extensive dataset for test-case prioritization (TCP) focused on long-running test suites.
LRTS has 100K+ test-suite runs from 30K+ recent CI builds with real test failures, from recent codebases of 10 popular, large-scale, multi-PL, multi-module, open-source software projects: ActiveMQ, Hadoop, HBase, Hive, Jackrabbit Oak, James, Kafka, Karaf, Log4j 2, TVM.
- 108,366 test-suite runs from 32,199 CI builds
- 49,674 failed test-suite runs (with at least one test failure) from 22,763 CI builds
- Build history span: 2020 to 2024
- Average test-suite run duration: 6.75 hours, with at least 75% of the runs last over 2 hours
- Average number of executed test classes per run: 980
- Average number of failed test classes per failed run: 5
Go to this link to download the processed LRTS. It contains the metadata of the dataset, test results at the test class level, and code change data of each test-suite run. We are actively looking for online storage to host the raw version which takes ~100GBs.
artifact
folder contains our code for downloading more builds from the listed projects, our TCP technique code implementation, and experiment scripts. To run our scripts on the processed dataset above, please refer to the instructions in the artifact/README.md
.