Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variants Removed but Not Filtered #531

Closed
SophieS9 opened this issue Nov 17, 2023 · 9 comments
Closed

Variants Removed but Not Filtered #531

SophieS9 opened this issue Nov 17, 2023 · 9 comments

Comments

@SophieS9
Copy link

SophieS9 commented Nov 17, 2023

Hi Exomiser Team,

Firstly, apologies if this is documented somewhere and I've missed it! I have a scenario where a WGS VCF is run through exomiser and the scores are collated from the TSV files and passed into an in-house database.

We have a scenario where we get updated phenotype information on a patient and want to re-rank the variants via Exomiser, but not the whole VCF, just a small subset of variants which pass internal filters so that it's fast.

I'm making a small VCF on the fly of these variants and passing to Exomiser, but only 30/69 variants are being analysed. This VCF has a dummy header and a dummy "QUAL" score and "INFO" column, but all other values are taken from the original vcf. When looking at the HTML, it's not clear why they aren't being analysed as it suggests that only 30 variants were input. All of these variants were analysed by Exomiser when in the WGS VCF. The config yaml is also set to look at 1000 variants.

image

I'd like to know why 39 of the variants are being excluded if possible?! I'm wondering if it's how I make my on the fly VCF. From looking at the stdout, it says variants are failing the frequency, pathogenicity and inheritance filters. However my config is set to keep all frequency and non pathogenic, and all genotypes are 0/1.

image

I've attached the vcf and config yaml (both changed to txt for upload). Running Exomiser 13.1.0 with dataset 2209_hg38.

14777_exomiser.txt
14777_exomiser_template.txt

@SophieS9
Copy link
Author

SophieS9 commented Nov 21, 2023

In case anyone comes across this, I realised that the issue was with the prioritisaition filter. The line in the config is
priorityScoreFilter: {priorityType: HIPHIVE_PRIORITY, minPriorityScore: 0.501}

This is prioritising the top 50% ish of variants for speed. As per manual:
"priorityScoreFilter:
Running the prioritizer followed by a priorityScoreFilter will remove genes which are least likely to contribute to the phenotype defined in hpoIds, this will dramatically reduce the time and memory required to analyze a genome. 0.501 is a good compromise to select good phenotype matches and the best protein-protein interactions hits using the hiPhive prioritizer. PriorityType can be one of HIPHIVE_PRIORITY, PHIVE_PRIORITY, PHENIX_PRIORITY, OMIM_PRIORITY, EXOMEWALKER_PRIORITY. Example priorityScoreFilter: {priorityType: HIPHIVE_PRIORITY, minPriorityScore: 0.501}"

As a general suggestion, it may be helpful to have this documented clearer on the HTML output so that it can clearly be seen how many variants have been removed for this reason.

@julesjacobsen
Copy link
Contributor

julesjacobsen commented Nov 22, 2023

Hi @SophieS9 thanks for the suggestion and the detailed bug report. There are a few issues here:

  1. As you discovered, using the PriorityScoreFilter will act as a very broad filter and is not what you want to use here (why did you add it?). The standard exome analysis setup or --preset EXOME will run these steps:
steps: [
    failedVariantFilter: { },
    variantEffectFilter: {
      remove: [
          FIVE_PRIME_UTR_EXON_VARIANT,
          FIVE_PRIME_UTR_INTRON_VARIANT,
          THREE_PRIME_UTR_EXON_VARIANT,
          THREE_PRIME_UTR_INTRON_VARIANT,
          NON_CODING_TRANSCRIPT_EXON_VARIANT,
          NON_CODING_TRANSCRIPT_INTRON_VARIANT,
          CODING_TRANSCRIPT_INTRON_VARIANT,
          UPSTREAM_GENE_VARIANT,
          DOWNSTREAM_GENE_VARIANT,
          INTERGENIC_VARIANT,
          REGULATORY_REGION_VARIANT
      ]
    },
    frequencyFilter: { maxFrequency: 2.0 },
    pathogenicityFilter: { keepNonPathogenic: true },
    inheritanceFilter: { },
    omimPrioritiser: { },
    hiPhivePrioritiser: { }
]

and this is probably what you want, tweaking the frequency filters. Note that the frequencyFilter will be applied before the inheritanceFilter so if it should be higher than any of the MOI-dependent cut-offs. Note that the steps are run in the order defined in the list and Exomiser will also re-order this if the inheritance filter is run before any other filters, so stick with this order.

  1. The PriorityScoreFilter operates on genes so the number reported will be genes, not variants. This is confusing as the number will be higher than the number of variants. Sorry.

  2. The number of variants reported to have failed a filter is not reported. This was an artefact from a long time ago where the failed variant count wasn't tracked internally, so couldn't be reported when run in analysisMode: PASS_ONLY. This is displayed in the log, as you've seen so I'll fix this in the HTML.

  3. If you want to know where the variants are failing use analysisMode: FULL and the failed filters will be displayed in the FILTER column of the variants.tsv file.

  4. This is more serious - the config you were using (thanks for including this), for some reason, was not annotating the variants with the frequency or pathogenicity scores so only the default scores based on the variant effect were being used. This should not be happening and I'll investigate the issue asap. To mitigate this, if you use the test-analysis-exome.yml file from the examples directory and use that as the template (or use the YAML from point 1 using exactly that format) and you should see the data appearing.

@julesjacobsen julesjacobsen reopened this Nov 22, 2023
@SophieS9
Copy link
Author

Thanks for your reply @julesjacobsen. This was super helpful! I quite simply overlooked the prioritization filter as I used the config we use for a WGS analysis (where we want the filter for speed). In this scenario, I don't need it anymore. Thanks for the feedback on the config too, I've made those changes and looks like the annotations are now correct. From what I could tell, the main difference was the use of spaces in between the curly brackets?

@julesjacobsen
Copy link
Contributor

@SophieS9 I thought that the issue might originally have been due to the difference in YAML style, however it turns out there was a bug in the way the steps were being run so that the frequency and pathogenicity filters would not end up with data to filter on. I've opened a new issue to explain this - #534.

Both the reporting of the priority score filter pass/fail counts (still only for genes) and the missing annotations will be fixed in the 13.4.0 release which will come out once tested.

In the meantime, you should review your analysis scripts and their output as if this is one you have been using frequently you should see that the annotations for frequency and pathogenicity are missing and you will have a lot of unfiltered data. Mostly this will mean that missense SNVs are not properly scored for potential pathogenicity and there will be no frequency filtering which will further affect all variant scores. Known ClinVar variants will still be prioritised, assuming you're using the ClinVar whitelist.

@josephhalstead
Copy link

josephhalstead commented Dec 6, 2023

image

Is it still not working properly or am I misunderstanding? The exomiser guy says it should be filtering variants?

As in this says non are failing the filters

@SophieS9
Copy link
Author

SophieS9 commented Dec 8, 2023

@josephhalstead I think this maybe a consequence of the HTML filtering summary not being updated (and also being on a per gene basis). As an example this is from the same sample, here is the HTML:

MicrosoftTeams-image (1)

And here is the log printed to standard out:
MicrosoftTeams-image

And it has removed variants based on the frequency score in this case.

@julesjacobsen
Copy link
Contributor

@josephhalstead sorry, I've been distracted and forgot to start the release process for this. It will be fixed in 13.4.0 and I'll aim to get this out next week, although this is a bit close to Christmas... @SophieS9 is correct in what she says.

julesjacobsen added a commit that referenced this issue Jan 31, 2024
…ent in logs and HTML output.

Also fixed bug where frequency and pathogenicity filters would not be provided with data when run after the initial variant load & filter step.
Moved analysis.FilterStats to new filters.FilterResultsCounter
Add new FilterResultCount data class
Add AnalysisResults.filterResultCounts field
Add new FilterRunner.filterCounts and FilterRunner.logFilterResult methods
Remove brittle logic for FilterStats from AbstractAnalysisRunner
Add Filterable.failedFilter method to enable tracking of both passed and failed filters (previously only passed was exposed)
@julesjacobsen
Copy link
Contributor

@SophieS9, @josephhalstead this is fixed in Exomiser v14.0.0

@josephhalstead
Copy link

Thankyou!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants