Association of Haplotype blocks to phenotypes (in Python) using a neural network machine learning method
A tool to test the association of variants in haplotype blocks to phenotypes. This tool takes variants (VCF format) called by any technolgy like Exome, WGS, RNASeq or SNP Genotyping Arrays and generates association test results.
Slides from our presentation at UCSC NCBI Hackathon
- Clone the repo
git clone https://github.com/NCBI-Hackathons/HapPyNet.git
- Install dependencies (varies depending on input)
- Generate SNP count matrix (Number of SNPs per LD block)
See README here - Run a neural net to classify samples into disease vs. normal
See README here
- Call variants using any platform (RNASeq, Exome, Whole Genome or SNP Arrays)
- Group variants by haplotype blocks to compute SNP load in each haplotype block
- Classify samples into disease vs normal, based on SNP load(number of SNPs per LD block) using a TensorFlow classifier
- Associate haplotypes with phenotype. As of Apr 2018, this is NOT implemented
-
LD Blocks : Non-overlapping LD blocks derived from 1KG data (hg19) were obtained from : Approximately independent linkage disequilibrium blocks in human populations, Bioinformatics. 2016 Jan 15; 32(2): 283–285 doi: 10.1093/bioinformatics/btv546. Using NCBI's online remapping tool these regions were mapped to GRCh38 with merge fragments turned ON to make sure each LD block is not fragmented
-
RNASeq samples: Initial training set from healthy and disease samples were obtained from SRA. The disease sample selection query was:
(AML) AND "Homo sapiens"[orgn:__txid9606] NOT ChIP-Seq
. List of SRR samples used are provided here
- RNASeq sample reads were aligned using HiSat2
- Variants were called using GATK version 4.0.3.0 and quality filtered at read depth of 50 and genotype quality of 90
- We trained a classifier with a 4 layer NeuralNet using TensorFlow with leave-one-out cross validation.
Our classifier model trained on our test AML and normal samples showed a 99% cross validated accuracy!
- Rerun on a large set of samples, with demographics and batch controlled normals
- Explore standard differential gene expression methods from Bioconductor
- Explore other normalization methods for Haplotype length and number of SNPs in samples