Skip to content
Steve Bond edited this page Oct 28, 2015 · 5 revisions

--clean_seq, -cs

Description

Remove all non-sequence characters from input. This will include any spaces, numbers, gap characters (e.g. '-'), stop characters (e.g. '*'), etc. Passing in the word 'strict' will also replace ambiguous/degenerate characters in nucleotide sequences with 'N'.

Nucleotide sequences: ATGCURYWSMKHBVDNX will be retained. If 'strict' is specified, only ATGCXNU will be retained.

Protein sequences: ACDEFGHIKLMNPQRSTVWXY will be retained. Using the 'strict' command has no effect.

Arguments

'strict' ( exact string )

Optional. By default, ambiguous nucleotide characters will be retained (i.e., the degenerate alphabet), but these can cause issues for some downstream analysis. Include the word 'strict' to replace ambiguous characters with a unified character ('N' by default).

Replacement character ( char )

Optional. If 'N' is not the desired replacement character for degenerate residues, specify a different one.

Examples

Input file: Mle-Panx_align.fa

>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMP-LHTPYPGIAPCVPEYDPVTQKYWLPCG----V
EEEDKAYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHL
VGKLSHWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHI
GNWFTYGIMFARR---SNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMN
QYLFLIVWYVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGP
SGRIILAKMSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPL
MHLNALMLGMVPQNLPEPKIQNIQRSQKKVRFLV*
>Mle-Panxα11
M--LISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGF
TKYDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGV---IPEEIPLCLGDNC---DKLAN
SNTTRVYHLWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKAD
SEKASIWLYHRFS-IYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFEL
ADFKQYGIVWAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVIN
QYIFLILWWALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGT
SGRVILNMLAASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDD----------------
------------PLL*-------------------

Usage example 1

$: sb Mle-Panx_align.fa -cs

Output

>Mle-Panxα1
MYWIFEICQEIKRAQSCRKFAIDGPFDWTNRIIMPTLMVICCFLQTFTFMFGSNISCIGF
EKLERNFVEEYCWTQGIYTSKAAYNMPLHTPYPGIAPCVPEYDPVTQKYWLPCGVEEEDK
AYHLWYQWVPFYFLAVAVGYYLPFLILKGSKLHQVKPLITYLMNQRNLETDPNHLVGKLS
HWIFRQLVYSRFAATSTIRMYWHDWGLVLLVCSVKILYLTVSLIHLFATAKMFHIGNWFT
YGIMFARRSNSHTTHVKDVFFPKMVACKIETWSFTGKNHLHGMCVLALNVMNQYLFLIVW
YVNVIIIFLNSISCIYTIVKFCSPNIVHHRIVNSSSLDDHHDFTRMFGYVGPSGRIILAK
MSEHMPGYMLKQVAKKVTEKIDIENEKNRGRAPTIKFTKVNGQPSELARQPLMHLNALML
GMVPQNLPEPKIQNIQRSQKKVRFLV
>Mle-Panxα11
MLISSLVQFSRLSPFKEITIDDGWDQLNRSFMFVLMVICGTIVTVRQHTGNIISCNGFTK
YDGSFSEDYCWTQGLYTIREAYHVSDVNVPYPGVIPEEIPLCLGDNCDKLANSNTTRVYH
LWYQWIPFYFWLASAAFFLPYLIYKRYGFGDIKPLIHMLYNPLDGDEGVKADSEKASIWL
YHRFSIYMNEHSMYANFMERHGIGILVIAIKVMYLIISVLLMVMTAMMFELADFKQYGIV
WAQQWPDPPANVTGIKDLLFPKMVACEIKRWGPTGLEDENGMCVLAPNVINQYIFLILWW
ALVFTIVSNVFNVLAGVIRIVFIYGSYRRMLASAFLRDDPHYKKVYYKIGTSGRVILNML
AASISPTCFQEIMNNVCPRLIRAHVSKKGRNLGDDPLL

Input file: ambiguous_cds.fa

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGT------------------CCAGACCCTGATCAGTTG
KTGTCACCGACGCGGATATCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTG
HCGGCTGCTGCCTTCTTCATGCCCTACCTTCTGTACA------TTGGCATGGGAGATATC
BAGCCTCTCGTGAG------ACACAATCCAGTAGAATCAGACCAGGAGTTAAAGAAGATG
VCAGACAAGGCTGCAACATGGCTGTTCTACAAGTTTGACCTGTACATGAGCGAACAGTCG
DTCCTAGCAAGTCTCACCAGAAAACACGGTCTTGGTCTATCCATGGTCTTTGTAAAGATC
NTATACGCCGCAGTGTCGTTCGGGTGTTTCCTCCTGACCGCTGAGATGTTCTCAATTGGA
XATTTTAAAACCTATGGATCAGAATGGATCAAGAAGTTAAAGTTGGAAGATAATCTAGCT
TAG---------------------------------------------------------

Usage example 2

$: sb ambiguous_cds.fa -cs

Output

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
RRRRRRRRRRRRCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
YCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
WGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
SATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
MGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGKTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGHCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCBAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGVCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGDTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCNTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGAXATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG

Usage example 3

$: sb ambiguous_cds.fa -cs strict

Output

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
NNNNNNNNNNNNCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
NCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
NGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
NATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
NGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGNTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGNCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCNAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGNCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGNTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCNTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGANATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG

Usage example 4

$: sb ambiguous_cds.fa -cs strict X

Output

>ML47742a.
ATGTTAGACATACTTTCAAAGTTTTGCTGAGTTACTCCTTTTAAAGGTATAACGATAGAT
XXXXXXXXXXXXCAACTCAATCGGAGTTTTATGTTCGTCCTGCTCGTTGTCATGGGAACG
XCTGTCACTGTCCGGCAATACACCGGCAGTGTCATCAGTTGTGACGGCTTCAAAAAGTTT
XGATCCACTTTTGCGGAGGATTACTGTTGTCCCCAGGGACTGTACACAGTTTTAGAAGGA
XATGAACCAGTCAGACTCAAGTTCCCTTACCCAGGCCTCCTTCCAGACGAGGCACCACCC
XGTACGACGGTACGAGGTTAAAGTCCAGACCCTGATCAGTTGXTGTCACCGACGCGGATA
TCCCACCTATGGTACCAGTGGGTCCCTTTTTACTTCTGGTTGXCGGCTGCTGCCTTCTTC
ATGCCCTACCTTCTGTACATTGGCATGGGAGATATCXAGCCTCTCGTGAGACACAATCCA
GTAGAATCAGACCAGGAGTTAAAGAAGATGXCAGACAAGGCTGCAACATGGCTGTTCTAC
AAGTTTGACCTGTACATGAGCGAACAGTCGXTCCTAGCAAGTCTCACCAGAAAACACGGT
CTTGGTCTATCCATGGTCTTTGTAAAGATCXTATACGCCGCAGTGTCGTTCGGGTGTTTC
CTCCTGACCGCTGAGATGTTCTCAATTGGAXATTTTAAAACCTATGGATCAGAATGGATC
AAGAAGTTAAAGTTGGAAGATAATCTAGCTTAG

Main Toolkit Pages





Further Reading

Clone this wiki locally