BioScripts

INTRODUCTION

Scripts for bioinformatics processing and analysis. Supplement to other more useful tools like bcftools (http://samtools.github.io/bcftools/), vcftools (https://vcftools.github.io/index.html) and etc. Most scripts were designed to support "pipe-in" and "pipe-out".

All PERL scripts are written under PERL5.

INSTALL

Add the PATH to "MyPerl" folder in PERL5LIB environment, e.g. "export PERL5LIB=${PATH_to_/site_perl}/MyPerl:$PERL5LIB", or copy it to the site_perl folder which present in PERL5LIB, or just copy it to the same folder where you run the script

USAGE

Simply type "perl certain_script.pl" or "perl certain_script.pl -h" for details of each option

File Process

fasta_process.pl

Query, extract and processing fasta sequences.

Note: some options could be combined but have priority orders, for example extract and sort could be run in a single step, while sort and extract will not work; break it into two or more steps under these situations.

Extract sequences

Query a single gene

  echo "${gene_id}" | fasta_process.pl --fasta all.seq --query - --rows 0 > gene.fa

Query multiple genes

  fasta_process.pl --rows 0 --query query_genes.list --fasta all_genes.fasta > query_genes.fas

Extract a single region

  echo "${query_region}" | fasta_process.pl --rows 0 1 2 --subset 1 2 --query - \
  		--fasta genome.fasta > query.fa

Extract multiple sequences from the genome

  fasta_process.pl --rows 0 1 2 --query regions.list --fasta genome.fasta --subset 1 2 > query_regions.fas

Extract flanking sequence of each variant site and replace the nucleotide at the variant site with the alternative allele

  perl -ne 'next if (/\#/); my @line = split /\t/; my $ex_start = $line[1]-75; my $ex_end = $line[1]+75;
      print "$line[0]\t$ex_start\t$ex_end\t76\t$line[3]\t$line[4]\n";' example.vcf | \
      fasta_process.pl --query - --fasta genome.fasta --rows 0 1 2 3 4 5 --subset 1 2 \
      --replace 3,4,5 > vars.ex75.replaced.fas

Do statistics of sequences

Count triplet contents

  fasta_process.pl --fasta genome.fasta --count-nucl triplet > triplets.csv

Count triplet contents in query sequences

  fasta_process.pl --fasta genome.fasta --query query.regions \
      --rows 1 3 4 0 2 --subset 3 4 --count-nucl triplet > query.triplets.csv

Extract di/tir-nucleotide contents in tabular format

  awk 'BEGIN{OFS="\t"} !/\#/ {print $1,$2-1,$2;}' example.vcf | \
      fasta_process.pl --query - --fasta genome.fasta \
      --rows 0 1 2 --subset 1 2 --out-format tabular | sed 's/\_/\t/g' > nt2-1.csv

Sequence manipulation

Translate nucleotides to proteins and remove final "*"

  fasta_process.pl --fasta cds.fasta --translate --wordwrap 60 | \
  	perl -ne 's/\*$//; next if(/^\s+$/); print;' > protein.fasta

Split multiple-sequences file into multiple single-sequence files
```
  fasta_process.pl --fasta multiple.fasta --split
```

Sort fasta file by a user defined order, fasta file could also given from a pipe

  cat *.fasta | fasta_process.pl --fasta - --out-order orders.list > sorted.fasta

Reverse complement sequences

  fasta_process.pl --fasta example.fasta --reverse --complement > rc.fasta

Filtering sequences

Filtering fasta file by length

  fasta_process.pl --fasta example.fasta --lower 100 --upper 2000 > len100_2000.fasta

Filtering fasta file by id

  fasta_process.pl --fasta example.fasta --match "scaffold|contig" > chromosome.fasta

rename_fasta.pl

Rename sequences ids in fasta format file

Rename sequences ids in list

  rename_fasta.pl --fasta origin.fas --word-wrap 80 --refer rename_ids.txt > renamed.fasta

Example about the rename_ids.txt:

  #old_id	new_id     
  seq1		human_seq1
  seq2		human_seq2
  ...

convert_fastq_quality.pl

Convert fastq encodings

Convert Sanger encoding to Illumina 1.5+

  zcat reads.fq.gz | convert_fastq_quality.pl -i - -c sanger2illumina | gzip -c > reads.converted.fq.gz

vcf_process.pl

VCF format file related processing.

This script does quite a lot things, including filtering, combining, clustering and etc., seems I put too many functions here ... However, since the VCF format generated from different caller varies, this script was manily tailored for vcf file generated from GATK (UnifiedGenotyper or HaplotypeCaller, http://www.broadinstitute.org/gatk/), some functions require the AD (allele depth) field, so it may not perform very well for VCF files generated from other caller.

Filtering variants

Filtering by depth, this only mark samples with depth failed this criteria as missing, but will not filter the whole locus
```
  vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 > depth_flt.vcf
```
Filtering by depth and number of missing allele calls, first check depth, then count all missing calls include those failed the depth criteria
```
  vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 --max-missing 8 > flt.vcf
```

Specify the depth for each sample in a file, the overall criteria will still be effective if some sample were not specified

  vcf_process.pl --vcf example.vcf.gz -min-sample-depth 10 --max-sample-depth 80 \
  	--depth-file sample_depth.txt > depth_flt.vcf

Specify some samples as natural homozygous sample (e.g. inbred lines), others would be treated as heterozygous, filtering heterozygous sites in homozygous samples (denoted as "pseudo-heterozygosity" here, mostly raised from mapping errors due to duplications)
```
  vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 --max-pseudo-het 0 > flt.vcf
```

Filtering by reference/non-reference sample counts, distinguish homozygous and heterozygous samples

  vcf_process.pl --vcf example.vcf.gz --homo-samples sample1 sample2 \
  	--min-hom-ref 5 --min-het-ref 4 --max-hom-missing 5 > flt.vcf

Screen out rare alleles (allele with sample frequency less than the specified value)
```
  vcf_process.pl --vcf example.vcf.gz --rare-only 3 > rare.vcf
```

Note: some filtering criteria have priority orders, do check the results after filtering!

Genotype manipulation

vcf_process.pl use the non-reference allele depth ratio (NRADR, reads support reference allele / all reads covered) to test whether the initial genotyping was really accurate, genotypes failed these criteria could be re-genotyped or set as missing, require AD fields (also add support for NR,NV tags generated from caller like Platypus, but less tested)

For homozygous samples, no heterozygous genotypes should be expected, NRADR should be near zero (reference homozygous) or near 100% (alternative homozygous), considering the sequencing errors, a conserved range could be 5%~95% for high coverage data (above 20x)
```
  vcf_process.pl --vcf hc.vcf.gz --default-sample-type hom --regenotype-hom 0.05 \
  	--gt-diff-as-missing > genotype.flt.vcf 
```
For heterozygous samples, we need two values, one for homozygous genotypes (same as used for homozygous samples), another for heterozygous genotypes (usually expect 50%, newly arised mutations could vary), 30%~70% maybe ok for a reliable heterozgyous call
```
  vcf_process.pl --vcf hc.vcf.gz --regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf 
```

Contain both homozygous and heterozygous samples

  vcf_process.pl --vcf hc.vcf.gz --homo-samples sample1 sample2 --regenotype-hom 0.05 \
  	--regenotype-het 0.05,0.3 --gt-diff-as-missing > genotype.flt.vcf

Collecting statistics and metrics of variants

Collect variants metrics, mainly designed for GATK callers

  vcf_process.pl --vcf hc.vcf.gz --out-metrics \
      --metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \
      > hc.metrics.csv

Metrics after filtering

  vcf_process.pl --vcf hc.vcf.gz \
      --quality 50 --min-alleles 2 --max-alleles 2 --min-sample-depth 10 --max-missing 14 | \
      vcf_process.pl --vcf - --out-metrics \
      --metrics DP MQ MQ0 BaseQRankSum ClippingRankSum MQRankSum ReadPosRankSum InbreedingCoeff FS SOR \
      > hc.flt.metrics.csv

Collect genotype infos, require the AD field

  vcf_process.pl --vcf hc.vcf.gz \
      --quality 50 --stats-only --out-genotype-stats --ref-depth --var-type snp > hc.snp.gts.csv

Generate statistics for each locus

  vcf_process.pl --vcf snp.vcf.gz --stats-only --out-locus-stats > snp.locus_stats.csv

Count base changes for all bi-allelic heterozygous sites

  vcf_process.pl --vcf example.vcf.gz \
      --min-alleles 2 --max-alleles 2 --base-changes --GT-types "0/1" > het.changes.csv

Generate statistics for distances between adjacent variants

  vcf_process.pl --vcf snp.vcf.gz --stat-var-dist --source-tag GT > snp.dist.csv

Summary of results generated from GATK DiagnoseTargets (https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_diagnostics_diagnosetargets_DiagnoseTargets.php)
```
  vcf_process.pl --vcf diagnose.vcf --sum-diagnose > diagnose.stats.csv
```

Get variant sequence context (experimental)

  vcf_process.pl --vcf snp.vcf.gz --check-context --fasta genome.fasta > snp.context.vcf

Notes for context checking:

1) Only bi-allelic loci is supported while analysis sequence context, multi-alleles need to be 
breaked first;
2) Extension here is different for SNPs and INDELs, e.g. upstream 5bp and downstream 5bp for SNPs, 
while only downstream 10bp for INDELs, thus the INDELs are assumed to be already left aligned

Clustering variants

Use vcf_process.pl to clustering markers (genetically linked regions). The clustering function is used to identify genome blocks through certain type of markers. This was done by fisrt search for the reliable seeds (segments with consecutive markers of the same type and pass the criteria, the "seeding" stage), then merge adjacent seeds with same type to form blocks (the "extension" stage), the boundary between blocks of different type was determined according to the markers present between two blocks or use the middle point while no more markers present. The "seeding-and-extension" algorithm was borrowed from "Wijnker, E. et al. The genomic landscape of meiotic crossovers and gene conversions in Arabidopsis thaliana. eLife 2, e01426 (2013)", which used for identify recombinat blocks.

Example about the input vcf file. The source type could be genotypes (use GT field) or other user defined types (e.g. ancestral status) an example of markers.vcf.gz could be:

  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample
  chr01   161     .       C       A       54.12   .       .       GT:SC   0/0:A/A
  chr01   431     .       C       T       44.2    .       .       GT:SC   0/0:A/A
  chr01   1641    .       G       A       64.15   .       .       GT:SC   1/1:B/B
  chr01   4165    .       C       A       34.31   .       .       GT:SC   1/1:B/B
  ...

No clustering, just output blocks with consecutive markers with same source type,

  vcf_process.pl --vcf markers.vcf.gz --out-blocks --source-tag "SC" > markers.blocks.csv

Clustering

  vcf_process.pl --vcf markers.vcf.gz --out-blocks --source-tag "SC" --fill-gaps \
      --min-frag-length 10000 --min-frag-markers 25 > markers.blocks.l10km25.csv

Use paintGenomeBlocks.pl to visually compare two results

  awk 'BEGIN{OFS="\t";} !/\#/ {$1 = $1"-original"; print;}' markers.blocks.csv | \
      cat markers.blocks.l10km25.csv - | \
      paintGenomeBlocks.pl --input - \
      --width 1600 --height 3000 --thickness 10 --chrom-distance 20 --block-distance 2 \
      --output markers.blocks.cmp --length reference_genome.fasta.fai \
      --colors "type1:strong_red2;B:strong_blue2" --sort-blocks sample-original sample --format png

Combining vcf files

Combine two vcf files according to the "CHROM" and "POS" fields

  vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 \
      --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf

Combine two vcf files according to the "CHROM", "POS" and "ALT" fields, if the "ALT" field differ, there will be two records in combined vcf file

  vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 4 \
      --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf

Combine two vcf files according to the "CHROM" and "POS" fields, but if the "ALT" field differ, write the "ALT" info of secondary file into "SDIFF" field

  vcf_process.pl --vcf hc.vcf --secondary-vcf ug.vcf --combine-rows 0 1 --compare-row 4 \
      --primary-tag HC --secondary-tag UG --intersect-tag "UG+HC" > combined.vcf

downsample_vcf.pl

Downsampling vcf file.

Random pick 100 sites from a VCF file

  downsample_vcf.pl -i input.vcf.gz -s 100 -t 100 -o rand

gff2fasta.pl

Extract sequences (mRNA or CDS) from gff3 file to fasta file.

Extract mRNA sequences

  gff2fasta.pl --gff example.gff --seqs example.fasta --features mRNA > example.mRNA.fasta

Extract CDS sequences

  gff2fasta.pl --gff example.gff --seqs example.fasta --features cds > example.cds.fasta

gff2tables.pl

Convert gff to tabular format

intervals2bed

Convert intervals to bed format, e.g. chr01:1-1000 -> chr01 0 1000

maskedSEQ2bed.pl

Extract masked (lowercase and Ns) regions in bed format, quite slow ...

bam2fasta.pl

Convert SAM/BAM to FASTA/Q format

Extract unmapped reads in BAM file and send it to blast+

  bam2fasta.pl -b example.bam -s "-f 4" | \
      		blastall -p blastn -d genome.fasta -e 1e-70 -a 1 -m 8 -F F -n T -o unmapped.blast.csv

sam2fastq.pl

Convert SAM/BAM to FASTQ format

Extract paired reads

  sam2fastq.pl -i example.sam -o prefix

extract_bam_pairs.pl

Extract reads in pairs from SAM/BAM file

Extract reads mapped to region "chr03 4000 5000"

  echo "chr03 4000 5000" | \
      extract_bam_pairs.pl --input - --extend 2000 --samtools "-f2 -F3852 -q20" \
      --min-insert 450 --max-insert 550 --max-clipping 2 --format fastq \
      --bam sample1.bam sample2.bam --output extracted

Note: Difference between bam2fasta.pl, sam2fastq.pl and extract_bam_pairs.pl

	bam2fasta.pl ignore paired infos;
	sam2fastq.pl convert in pairs;
	extract_bam_pairs.pl extract all reads overlap certain regions and search for its pairs.

They can be actually integrated, but why 3 scripts? Because I forget the previous one when I started writing a new one, and finally I got three ...

extract_split_seqs.pl

Extract multiple fragments of a sequence and join the fragments into a new sequence

Extract Pfam domains (use protein positions to extract CDS sequences), expected format for “pfam_domain_pos.csv”: "seq_id chr1:start1-end1;chr1:start2-end2"
```
  extract_split_seqs.pl --query pfam_domain_pos.csv --convert-pos pro2cds --fasta cds.fasta > extract.fasta
```

Calculation

calc_vcf_diversity.pl

Calculate within- or between-groups diversities from a vcf file, an alternative choice of "vcftools --window-pi".

Calculate pairwise diversity in non-overlapping windows

  calc_vcf_diversity.pl --vcf test.vcf.gz --window-size 500000 --no-group --no-within > pairs_divs.csv

Calculate pairwise diversity in specified regions

  calc_vcf_diversity.pl --vcf test.vcf.gz --intervals test.regions --no-within --no-group > pairs_divs.csv

Calculate within and between group diversity in non-overlapping windows

  calc_vcf_diversity.pl --vcf test.vcf.gz --window-size 500000 \
  	--group test.sample_panel --all-pairs  --threads 2 > group_divs.csv

Calculate within and between group diversity in specified regions

  calc_vcf_diversity.pl --vcf test.vcf.gz --intervals test.regions \
  		--group test.sample_panel --all-pairs --threads 2 > group_divs.csv

Note: values calculated directly from VCF file tends to underestimate the true diversity as it divide all sites rather than informative sites, one workaround is to use a three step calculation which also get the number of informative sites

First count all informative sites (this requires the VCF file also contain the non-variant allele infos)
```
  count_vcf_pairs.pl --vcf sites.vcf.gz --intervals example.regions > pair_infos.csv
```

Second, calculate differences between each pairs

  calc_vcf_diversity.pl --vcf snp.vcf.gz --no-group --no-within --intervals example.regions > pair_diffs.csv

Finally, obtain the adjusted diversity by diffs / informative_sites

  calc_pop_divs.pl --query pair_diffs.csv --subject pair_infos.csv --min-info-perc 50 > pairs_divs.mean.csv

count_vcf_pairs.pl

Accompany script of calc_vcf_diversity.pl to count informative sites

calc_pop_divs.pl

Post-calculation script of results from calc_vcf_diversity.pl and count_vcf_pairs.pl

slide_windows_count.pl

Count by slide windows

Count in 100kb windows, excluding scaffolds

  slide_windows_count.pl -i example.csv -w 100000 -l genome.fa.fai -e scaffold > example.100k.csv

Count with different types

  slide_windows_count.pl -i example.csv -w 100000 -l genome.fa.fai -t > example.100k.csv

Note: This script use two different algorithms while counting non-overlapping and overlapping (when a step size is specified) windows. The first algorithm is relatively fast and does not require too much memory as it count on-the-fly while reading file; however, count non-overlapping windows needs first read the entire file, and the subsequent counting rely on the grep function is not quite efficient, thus could be very slow and memory-intensive while process large files.

Visualization

paintGenomeBlocks.pl

Plot blocks along each chromosome.

Plot recombinat blocks with centromeres

  cat centromere.csv blocks.csv | \
      paintGenomeBlocks.pl --input - --rows 0 1 2 3 4 --split-pairs \
      --width 1600 --height 12000 --thickness 10 --chrom-distance 20 --block-distance 2 \
      --output blocks --length genome.fa.fai --format svg \
      --colors "A:strong_red2;B:strong_blue2;Centromere:grey" --sort-blocks Centromere sample1 sample2

Plot variants (just extend it to a "block")

  awk 'BEGIN{OFS="\t"} !/\#/ {print "Example",$1,"SNP",$2-10,$2+10;}' example.vcf | \
      paintGenomeBlocks.pl --input - \
      --width 800 --height 600 --thickness 20 --chrom-distance 20 --block-distance 2 \
      --output example.snp --length genome.fa.fai --colors "SNP:red"

ggplot-windows-count.R

Plot distributions with ggplot2

Prepare examples

  bgzip -dc example1.vcf.gz | slide_windows_count.pl --input - --window-size 200000 > example1.w200k.csv
  bgzip -dc example2.vcf.gz | slide_windows_count.pl --input - --window-size 200000 > example2.w200k.csv

Reformat

  cat example1.w200k.csv | \
      awk 'BEGIN{OFS="\t"} !/##/ {if(/#/){print "Sample\tCHROM\tINTERVAL\tBIN_ID\tCOUNT";}
      	else{print "Example1",$1,$2,$3,$4;}}' \
      > example.w200k.txt
  
  cat example2.w200k.csv | \
      awk 'BEGIN{OFS="\t"} !/#/ {print "Example2",$1,$2,$3,$4;}' \
      >> example.w200k.txt

Plot

  Rscript ggplot-windows-count.R example.w200k.txt example.w200k.tiff

ggplot-windows-multi.R

Alternative to ggplot-windows-count.R, designed for different input format

Prepare examples

  echo -e "chr01\t122\tSNP\nchr01\t142\tSNP\nchr01\t254\tINDEL\n" | \
      slide_windows_count.pl -i - -w 500 -t | cut -f 4 --complement | \
      perl -ne 'next if (/^\#\#/); s/\#//g; print;' \
      > counts.w500.txt

Plot

  Rscript ggplot-windows-multi.R counts.w500.txt counts.w500.tiff

ggplot-liner-fit.R

Liner fit with ggplot2

plot-scatter_with_lines.R

Scatter plot

Miscellaneous

reference_align.pl

Mapping with clustalw2(http://www.clustal.org/clustal2/) or muscle(http://www.drive5.com/muscle/).

Align sequences to a reference sequence, this was done by 2 steps:

Step1: align the first sequence to the reference sequence, and get a expanded reference sequence with gaps inserted, then align the second sequence to the new reference sequence, iterate this process to generate a reference sequence expand all query sequences;

Step2: build a consensus sequence with nucleotide replaced in reference sequence, and re-align all sequences to the consensus sequence.

Align all subsequent sequences to the first sequence use clustalw2
```
  reference_align.pl -i example.fa > example.aln.fas
```

Align with muscle

  reference_align.pl -i example.fa -a muscle > example.aln.fas

parallel_baseml.pl

Wrapper to run PAML (baseml) in parallel

Run all models in a shot

  parallel_baseml.pl --input aln.fas --model 0 1 2 3 4 5 6 7 8 --align-length 100 > aln.csv

parallel_codeml.pl

Wrapper to run PAML (codeml) in parallel

Run with multiple threads

  parallel_codeml.pl --ref ref.tbl --samples samples.tbl --query query.tsv --threads 6 > query.codeml.csv

ascii2num

Convert ASCII characters to Numbers

Mainly used to inspect Fastq qualities
```
  ascii2num "AAAA"
```

map_records.pl

Mapping records from subject file to query file

Map records based on the 1st column, merge duplicates to a single line

  map_records.pl --rows1 0 --rows2 0 --merge-dups --query query.list --subject subject.txt > mapped.csv

These scripts are free softwares; you can redistribute it and/or modify it under the same terms as Perl itself.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
site_perl/MyPerl		site_perl/MyPerl
.gitattributes		.gitattributes
README.md		README.md
ascii2num		ascii2num
bam2fasta.pl		bam2fasta.pl
calc_pop_divs.pl		calc_pop_divs.pl
calc_vcf_diversity.pl		calc_vcf_diversity.pl
convert_fastq_quality.pl		convert_fastq_quality.pl
convert_trf.pl		convert_trf.pl
count_vcf_pairs.pl		count_vcf_pairs.pl
detect_mutations.pl		detect_mutations.pl
downsample_vcf.pl		downsample_vcf.pl
extract_bam_pairs.pl		extract_bam_pairs.pl
extract_split_seqs.pl		extract_split_seqs.pl
fasta_process.pl		fasta_process.pl
gff2fasta.pl		gff2fasta.pl
gff2tables.pl		gff2tables.pl
ggplot-liner-fit.R		ggplot-liner-fit.R
ggplot-windows-count.R		ggplot-windows-count.R
ggplot-windows-multi.R		ggplot-windows-multi.R
intervals2bed		intervals2bed
map_pos2intervals.pl		map_pos2intervals.pl
map_records.pl		map_records.pl
maskedSEQ2bed.pl		maskedSEQ2bed.pl
paintGenomeBlocks.pl		paintGenomeBlocks.pl
parallel_baseml.pl		parallel_baseml.pl
parallel_codeml.pl		parallel_codeml.pl
parseSEQs.pl		parseSEQs.pl
plot-scatter_with_lines.R		plot-scatter_with_lines.R
reference_align.pl		reference_align.pl
rename_fasta.pl		rename_fasta.pl
sam2fastq.pl		sam2fastq.pl
subtotal_stats.pl		subtotal_stats.pl
vcf2tables.pl		vcf2tables.pl
vcf_process.pl		vcf_process.pl

wl13/BioScripts

Folders and files

Latest commit

History

Repository files navigation