File formats and conversions
Important formats How Fasta Raw/Peptide Tab
How 1. One or more entries 1. First line 1. Length of sequence (6 digits right aligned) 2. Name of sequence 2. Next lines 1. Sequence, usually 80 characters pr line 3. Last lines 1. Assignments of the positions in the sequence
How file 553 ATP0_BOVIN_1E79.C MLSVRVAAAVARALPRRAGLVSKNALGSSFIAARNLHASNSRLQKTGTAEVSSILEERILGADTSVDLEETGRVLSIGDG IARVHGLRNVQAEEMVEFSSGLKGMSLNLEPDNVGVVVFGNDKLIKEGDIVKRTGAIVDVPVGEELLGRVVDALGNAIDG KGPIGSKARRRVGLKAPGIIPRISVREPMQTGIKAVDSLVPIGRGQRELIIGDRQTGKTSIAIDTIINQKRFNDGTDEKK KLYCIYVAIGQKRSTVAQLVKRLTDADAMKYTIVVSATASDAAPLQYLAPYSGCSMGEYFRDNGKHALIIYDDLSKQAVA YRQMSLLLRRPPGREAYPGDVFYLHSRLLERAAKMNDAFGGGSLTALPVIETQAGDVSAYIPTNVISITDGQIFLETELF YKGIRPAINVGLSVSRVGSAAQTRAMKQVAGTMKLELAQYREVAAFAQFGSDLDAATQQLLSRGVRLTELLKQGQYSPMA IEEQVAVIYAGVRGYLDKLEPSKITKFENAFLSHVISQHQALLSKIRTDGKISEESDAKLKEIVTNFLAGFEA SS.TTTEEEEEEEETT EEEEEE.TT.BTTEEEEETTS.EEEEEEE.SS.EEEEESS.GGG..TT.EEEEEEEESEEE.SGGGTT.EE.TTS.B.SS S.....S.EEETT.....STTB....SB...S.HHHHHHS..BTT.B.EEEESTTSSHHHHHHHHHHHTHHHHSSS.GGG..EEEEEEES..HHHHHHHHHHHHHHT.GGGEEEEEE.TTS.HHHHHHHHHHHHHHHHHHHHTT.EEEEEEETHHHHHHH HHHHHHHTT....GGGS.TTHHHHHHHHHTT..BB.GGGTS.EEEEEEEEE.STT.TTSHHHHHHHTTSSEEEEE.HHHH HHT.SS.B.TTT.EESSGGGGS.HHHHHHHTTHHHHHHHHHHHHHHHTT.....HHHHHHHHHHHHHHHHT...SS.... HHHHHHHHHHHHTSTTTTS.GGGHHHHHHHHHHHHHHH.HHHHHHHHHHTS..HHHHHHHHHHHHHHHHHHH.
Fasta 1. One or more entries 1. First line 1. The character “>” 2. The name 3. Optional descriptions not read by all readers 2. Rest of lines 1. The sequence usually characteres per line
Raw/peptide Short sequences One peptide per line
Tab format 1. One or more entries 1. One entry per line 2. Tab delimited fields 1. Name 2. Sequence 3. Assignments/features
Converters Saco_convert –From/To How Fasta Tab Makefsa –Raw peptides to fasta peptides
Databases at CBS
Databases - ready for BLAST SwissProt PDB GenBank nr –Non redundant set of proteins from the above plus TREMBL, PIR and others sptr_nrdb –Non redundant set of proteins from SwissProt and TREMBL
BLAST routines - single search blastp –aadb aaquery blastn –ntdb ntquery blastx –aadb ntquery tblastn –ntdb aaquery tblastx –ntdb ntquery
Blastpgp - iterative blast Repetetive searches with AA query through an AA database Results in hits plus an optional position specific scoring matrix
The actual search Query is single file in FASTA format Costum databases need to be initially formatted from sets in FASTA format –Use setdb program for protein sequence databases (i.e., blastp and blastx) –Use pressdb program for nucleotide sequence databases (i.e., blastn and tblastn) –Use formatdb for blastpgp (psiblast)
Exercises
Conversion exersise Convert the file A1.rsee.test to fasta format Convert the file ss_sub300.how to fasta format
Blast Take the first entry in ss_sub300.how and blastp it against ss_sub300.how and PDB Make a position specific scoring matrix for the entry using psiblast and nr and save the profile as binary and readable matrices Use the binary matrix to search against PDB and ss_sub300.how