Merge pull request #8 from svandenhoek/new

Removed obsolete code & updated README
molgenis · Feb 13, 2020 · 70c3311 · 70c3311
2 parents 5c26b44 + a1ef6c6
commit 70c3311
Show file tree

Hide file tree

Showing 6 changed files with 92 additions and 448 deletions.
diff --git a/README.md b/README.md
@@ -1,76 +1,24 @@
 # vibe-suppl
-This repo contains supplemental files regarding the Java application found [here][vibe]. Note that these are in no way
-needed to use the vibe tool, but were used to generate additional information (such as benchmarking). They were created
-with the assumption that they are used exactly in the way they are meant to be used, so while certain checks/validations
-might be present, using these scripts in the wrong way might result in weird behavior.
+This repo contains supplemental files regarding the Java application found [here][vibe]. Note that these are in no way needed to use the vibe tool, but were used to generate additional information (such as benchmarking). They were created with the assumption that they are used exactly in the way they are meant to be used, so while certain checks/validations might be present, using these scripts in the wrong way might result in weird behavior.
 
-## Benchmarking
+## Paper
+
+Please refer to the `README.md` at https://zenodo.org/record/3662470 for the exact commits used for the benchmarking. There, all required files for [PaperPlots.R](benchmarking_results_processing/PaperPlots.R) can be found as well.
 
-### Scripts
-
-There are several benchmarking scripts available with some generic code used by multiple benchmarks in a separate file.
-An explanation on how to run the can be found below. In general, the `Runner` scripts runs the benchmark while the
-`FileGenerator` script (if available) formats the `Runner` output to a more usable format. Some exceptions are present,
-such as for vibe where there is a `ParallelBashScriptsGenerator` instead. So please refer to to
-<a href="#running-the-benchmarks">this section<a/> for more information regarding running the individual benchmarks.
-
-* __`AmelieApiOutputGenerator.py`__
-    * __Info:__ Connects to `https://amelie.stanford.edu/api/` to retrieve the gene scores for each set of HPO terms
-    available in the benchmark data. As the genes of interest should be entered manually and there is a limit in the
-    number of entered genes, the [complete HGNC dataset][hgnc_complete]
-    is used and divided over multiple separate requests so that all genes get a score. As the scores are only sorted
-    per request, a sort on all genes is done prior to file writing.
-* __`AmelieBenchmarkRunner.py`__
-    * __Info:__ Converts the output from `AmelieApiOutputGenerator.py` for usage in `BenchmarkResultsProcessor.R`.
-* __`BenchmarkFileHpoConverter.py`__
-    * __Info:__ A script to convert a benchmark file containing HPO names in the fifth column to a benchmark file with
-     HPO codes in the fifth column. Should not be needed for running existing benchmarks, but is supplied as a
-     convenience script in case benchmarks are created that cannot use `BenchmarkGenerics.py` but do need HPO codes as
-     input. 
-* __`BenchmarkGenerics.py`__
-    * __Info:__ Contains methods used in multiple scripts.
-    * __Important:__ This script should not be ran independently. If Python scripts are moved (for example to a server
-    to run the benchmarks there), be sure to include this file within the same directory.
-* __`BenchmarkResultsProcessor.R`__
-    * __Info:__ Creates plots from the benchmark data.
-* __`GeneNetworkBenchmarkFileGenerator.py`__
-    * __Info:__ Converts the output from `GeneNetworkBenchmarkRunner.py` for usage in `BenchmarkResultsProcessor.R`.
-* __`GeneNetworkBenchmarkRunner.py`__
-    * __Info:__ Connects to the API from `https://www.genenetwork.nl/` to retrieve the prioritized genes based on input
-    phenotypes.
-* __`PhenomizerBenchmarkFileGenerator.py`__
-    * __Info:__ Converts the output from `PhenomizerBenchmarkRunner.py` for usage in `BenchmarkResultsProcessor.R`.
-* __`PhenomizerBenchmarkRunner.py`__
-    * __Info:__ Uses the [query_phenomizer][query_phenomizer] python tool to process all benchmark data.
-    * __Important:__ [query_phenomizer][query_phenomizer] needs to be installed on the system. Additionally, an account
-    is needed for running [query_phenomizer][query_phenomizer].
-* __`PhenotipsBenchmarkRunner.py`__
-    * __Info:__ Uses the API of Phenotips to upload the benchmark dataset and then download the results.
-    * __Important:__ A phenotips instince to which can be connected is required. Please refer to the
-    [Phenotips download page][phenotips_download] for more information.
-* __`VibeBenchmarkFileGenerator.py`__
-    * __Info:__ Converts the output from `VibeBenchmarkParallelBashScriptsGenerator.py` for usage in `BenchmarkResultsProcessor.R`.
-* __`VibeBenchmarkParallelBashScriptsGenerator.py`__
-    * __Info:__ Generates bash files used for benchmarking (by using a limit of runs per file). Note that for each
-    created bash script a separate TDB is needed. Please refer to the documentation in the script itself for more
-    information.
-    * __Important:__ As each VIBE instance needs a separate database, please refer to the information in the script
-    itself for how to prepare for the benchmarking correctly.
-* __`VibeSimpleOutputFilesMerger.sh`__
-    * __Info:__ Merges the output generated by the scripts which were created using 
-    `VibeBenchmarkBashScriptsGenerator.py`. 
+## Benchmarking
 
 ### Data
 
 There are several files used among these scripts. These include:
-* benchmark_data.tsv
+* [benchmark_data.tsv](https://zenodo.org/record/3662470/files/benchmark_data-hgnc_symbol.tsv)
     * A dataset with the first column being an ID and the fourth column 1 or more phenotypes separated
     by a comma (the phenotype names should exist within the [Human Phenotype Ontology][hpo_obo]) .
 * [hp.obo][hpo_obo]
-    * The Human Phenotype Ontology used for combining/converting phenotype names with their HPO ID.
+    * The Human Phenotype Ontology used for combining/converting phenotype names with their HPO ID. Note that the `benchmark_data.tsv` was made compatible for release 2018-03-08 specifically.
 * [hgnc_complete_set.txt][hgnc_complete]
-    * The HUGO Gene Nomenclature Committee file containing information about genes (primarily used to generate a list
-    containing all genes).
+    * The HUGO Gene Nomenclature Committee file containing information about genes (primarily used to generate a list containing all genes).
+* [benchmark_file_conversion_data.tsv](https://www.genenames.org/cgi-bin/download/custom?col=gd_hgnc_id&col=gd_app_sym&col=gd_prev_sym&col=md_eg_id&col=gd_pub_eg_id&status=Approved&status=Entry%20Withdrawn&hgnc_dbtag=on&order_by=gd_hgnc_id&format=text&submit=submit)
+  * A file generated through [genenames.org](https://www.genenames.org/) that contains HGNC gene symbols with their previous symbols and their NCBI gene IDs.
 
 ### Running the benchmarks
 
@@ -86,17 +34,63 @@ There are several files used among these scripts. These include:
     python3 AmelieBenchmarkFileGenerator.py amelie_output/ amelie_results.tsv
     ```
 
-#### Gene Network
+3. Convert the HGNC gene symbols to NCBI gene IDS:
 
-1. Run benchmark:
     ```
-    python3 GeneNetworkBenchmarkRunner.py hp.obo benchmark_data.tsv genenetwork_output/
+    python3 BenchmarkFileGeneSymbolToIdConverter.py amelie_results.tsv benchmark_file_conversion_data.tsv 1> amelie.log 2> amelie.err
     ```
 
+#### Exomiser
+
+**IMPORTANT:** A custom `.jar` file supplied by the Exomiser team was supplied to run this benchmark without requiring a `.vcf` file. Exomiser has not yet made a public release of this yet. This custom `.jar` however is based on the exomiser-rest-prioritiser module of the Exomiser open-source code (release 12.1.0).
+
+##### hiPHIVE
+
+1. Run benchmark:
+
+   ```
+   python3 ExomiserBenchmarkRunner.py hp.obo benchmark_data.tsv hiphive hiphive_output/
+   ```
+
 2. Process benchmark output:
-    ```
-    python3 GeneNetworkBenchmarkFileGenerator.py genenetwork_output/ genenetwork_results.tsv
-    ```
+
+   ```
+   python3 ExomiserBenchmarkFileGenerator.py hiphive_output/ hiphive_results.tsv
+   ```
+
+3. Convert the HGNC gene symbols to NCBI gene IDS:
+
+   ```
+   python3 BenchmarkFileGeneSymbolToIdConverter.py hiphive_results.tsv benchmark_file_conversion_data.tsv 1> hiphive.log 2> hiphive.err
+   ```
+
+##### PhenIX
+
+1. Run benchmark:
+
+   ```
+   python3 ExomiserBenchmarkRunner.py hp.obo benchmark_data.tsv phenix phenix_output/
+   ```
+
+2. Process benchmark output:
+
+   ```
+   python3 ExomiserBenchmarkFileGenerator.py phenix_output/ phenix_results.tsv
+   ```
+
+3. Convert the HGNC gene symbols to NCBI gene IDS:
+
+   ```
+   python3 BenchmarkFileGeneSymbolToIdConverter.py phenix_results.tsv benchmark_file_conversion_data.tsv 1> phenix.log 2> phenix.err
+   ```
+
+#### GADO
+
+We used the stand-alone commandline version GADO (v 1.0.1), available at: https://github.com/molgenis/systemsgenetics/wiki/GADO-Command-line. We accepted all automatically suggested alternative HPO terms in cases that the supplied HPO term could not be used. We have used the prediction matrix `hpo_predictions_sigOnly_spiked_01_02_2018`. The output was also converted to NCBI gene IDs through the following:
+
+```
+python3 BenchmarkFileGeneSymbolToIdConverter.py gado_results.tsv benchmark_file_conversion_data.tsv 1> gado.log 2> gado.err
+```
 
 #### Phenomizer
 
@@ -116,15 +110,36 @@ There are several files used among these scripts. These include:
     ```
     python3 PhenomizerBenchmarkFileGenerator.py phenomizer_output/ phenomizer_results.tsv
     ```
+
+4. Convert the HGNC gene symbols to NCBI gene IDS:
+
+    ```
+    python3 BenchmarkFileGeneSymbolToIdConverter.py phenomizer_results.tsv benchmark_file_conversion_data.tsv 1> phenomizer.log 2> phenimozer.err
+    ```
 
 #### Phenotips
 
-1. Install [phenotips][phenotips_download].
+**IMPORTANT**: As of January 2020, Phenotips does not offer a stand-alone downloadable solution anymore and requires a paid cloud subscription to be used ([source](https://phenotips.com/blog/new-year-new-website.html)). While the [GitHub repo](https://github.com/phenotips/phenotips) is currently still online, it seems uncertain whether it will still be updated and the easy-to-use `.dmg` as offered on the old website is not available anymore. Therefore, this benchmark is deemed obsolete.
 
-2. Run benchmark:
-    ```
-    python3 PhenotipsBenchmarkRunner.py http://localhost:8080/ username hp.obo benchmark_data.tsv phenotips_results.tsv
-    ```
+#### PubCaseFinder
+
+1. Run benchmark:
+
+   ```
+   python3 PubCaseFinderBenchmarkRunner.py hp.obo benchmark_data.tsv pubcasefinder_output/
+   ```
+
+2. Process benchmark output:
+
+   ```
+   python3 PubCaseFinderBenchmarkFileGenerator.py pubcasefinder_output/ pubcasefinder_results.tsv
+   ```
+
+3. Convert the HGNC gene symbols to NCBI gene IDS:
+
+   ```
+   python3 BenchmarkFileGeneSymbolToIdConverter.py amelie_results.tsv benchmark_file_conversion_data.tsv 1> amelie.log 2> amelie.err
+   ```
 
 #### Vibe
 
@@ -164,16 +179,15 @@ There are several files used among these scripts. These include:
 
 7. Process benchmark output:
     ```
-    python3 VibeBenchmarkFileGenerator.py results/ vibe_results.tsv
+    python3 VibeBenchmarkFileGenerator.py results/ vibe_results.tsv none
     ```
 
-
-
 [vibe]:https://github.com/molgenis/vibe
-[vibe_preperations]:https://github.com/molgenis/vibe/#preparations
+[vibe_preperations]:https://github.com/molgenis/vibe/#quickstart
 [hgnc_complete]:http://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt
 [query_phenomizer]:https://github.com/svandenhoek/query_phenomizer
 [phenotips_download]:https://phenotips.org/Download
 
 [hpo_obo_current]:http://purl.obolibrary.org/obo/hp.obo
-[hpo_obo]:https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/2f6309173883d5d342849388c74bd986a2c0092c/hp.obo
+[hpo_obo]:https://raw.githubusercontent.com/obophenotype/human-phenotype-ontology/2f6309173883d5d342849388c74bd986a2c0092c/hp.obo
+
diff --git a/benchmarking/GADOBenchmarkReadme b/benchmarking/GADOBenchmarkReadme
diff --git a/benchmarking/GeneNetworkBenchmarkFileGenerator.py b/benchmarking/GeneNetworkBenchmarkFileGenerator.py