GitHub - NCBI-Hackathons/HapPyNet: Haplotype Block Based Dimensionality Reduction for Complex Variant-Disease Associations

Association of Haplotype blocks to phenotypes (in Python) using a neural network machine learning method

A tool to test the association of variants in haplotype blocks to phenotypes. This tool takes variants (VCF format) called by any technolgy like Exome, WGS, RNASeq or SNP Genotyping Arrays and generates association test results.

Slides from our presentation at UCSC NCBI Hackathon

Install

Clone the repo
- git clone https://github.com/NCBI-Hackathons/HapPyNet.git
Install dependencies (varies depending on input)
- Details here and here

Usage

Generate SNP count matrix (Number of SNPs per LD block)
See README here
Run a neural net to classify samples into disease vs. normal
See README here

Method

Call variants using any platform (RNASeq, Exome, Whole Genome or SNP Arrays)
Group variants by haplotype blocks to compute SNP load in each haplotype block
Classify samples into disease vs normal, based on SNP load(number of SNPs per LD block) using a TensorFlow classifier
Associate haplotypes with phenotype. As of Apr 2018, this is NOT implemented

Data sources

LD Blocks : Non-overlapping LD blocks derived from 1KG data (hg19) were obtained from : Approximately independent linkage disequilibrium blocks in human populations, Bioinformatics. 2016 Jan 15; 32(2): 283–285 doi: 10.1093/bioinformatics/btv546. Using NCBI's online remapping tool these regions were mapped to GRCh38 with merge fragments turned ON to make sure each LD block is not fragmented
RNASeq samples: Initial training set from healthy and disease samples were obtained from SRA. The disease sample selection query was: (AML) AND "Homo sapiens"[orgn:__txid9606] NOT ChIP-Seq. List of SRR samples used are provided here

RNASeq Variant Calling Pipeline

RNASeq sample reads were aligned using HiSat2
Variants were called using GATK version 4.0.3.0 and quality filtered at read depth of 50 and genotype quality of 90

Machine Learning

We trained a classifier with a 4 layer NeuralNet using TensorFlow with leave-one-out cross validation.

Results

Our classifier model trained on our test AML and normal samples showed a 99% cross validated accuracy!

Next steps

Rerun on a large set of samples, with demographics and batch controlled normals
Explore standard differential gene expression methods from Bioconductor
Explore other normalization methods for Haplotype length and number of SNPs in samples

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
docs		docs
ref_data/ldetect_GRCh38		ref_data/ldetect_GRCh38
src		src
test_data		test_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
training.ipynb		training.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Install

Usage

Method

Data sources

RNASeq Variant Calling Pipeline

Machine Learning

Results

Next steps

About

Releases

Packages

Contributors 5

Languages

License

NCBI-Hackathons/HapPyNet

Folders and files

Latest commit

History

Repository files navigation

Install

Usage

Method

Data sources

RNASeq Variant Calling Pipeline

Machine Learning

Results

Next steps

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages