From Pairwise Alignment to Database Similarity Search.

From Pairwise Alignment to Database Similarity Search

חיידק המלפפון: המדריך לישראלי המודאג נוסעים לגרמניה? תיזהרו מירקות, תשטפו הכל היטב והקפידו על כללי היגיינה. חזרתם משם? אם אתם חשים ברע, פנו לרופא. ומה הסיכוי שהזיהום הקטלני יגיע לישראל? הרשויות עושות הכל כדי למנוע את זה. כל התשובות על החיידק שמפחיד את העולם….. Why was this e-coli in the cucumbers so dangerous? Shiga toxin – a unique protein in e-coli 0104 which is highly pathogenic

new sequence ? Sequence Database ≈ Similar function Discover Function of a new sequence

4 Sequences that are similar probably have the same function Why do we care to align sequences?

Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 13 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

So what can we do?

Searching databases Solution: Use a heuristic (approximate) algorithm

Heuristic strategy Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

AAAAAAAAAAA ATATATATATATA Transposable elements (LINEs, SINEs) What sequences to remove? Low complexity sequences

Low Complexity Sequences What's wrong with them? Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

Low Complexity Sequences Complexity is calculated as: Where N=4 in DNA (4 bases), L is the length of the sequence and n i the number of each residue in the sequence K=1/L log N (L!/Π n i !) all i For the sequence GGGG: L! =4x3x2x1=24 n g =4 n c =0 n a =0 n t =0 Πn i =24x1x1x1=24 K =1/4 log 4 (24/24)=0 For the sequence CTGA: L! =4x3x2x1=24 ng =1 nc =1 na =1 nt =1 Πni =1x1x1x1 K =1/4 log 4 (24/1)=0.573

Heuristic strategy Remove low-complexity regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of absolute identity: –First, identify very short almost exact matches. –Next, the best short hits from the 1st step are extended to longer regions of similarity. –Finally, the best hits are optimized using the Smith- Waterman algorithm. Altschul et al 1990

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (> T) Example: …FSGTWYA… A list of matching words : FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS

BLAST (Protein Sequence Example) 1.Search the database for matching word pairs (>T) 2.Extend word pairs as much as possible, i.e., as long as the total score increases Result: High-scoring Segment Pairs (HSPs) THEFIRSTLINIHFSGTWYAAMESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEWASNINETEEN

BLAST 3. Try to connect HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

The expect value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p (p-value). page 105 How to interpret a BLAST search: For each blast score we can calculate an E-value

BLAST- E value: Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

From raw scores to bit scores Bit scores S’ are normalized and are comparable in different databases The E value corresponding to a given bit score is: E = mn 2 -S’ page 106

What is a Good E-value (Thumb rule) E values of less than 0.00001 show that sequences are almost always homologues. Greater E values, can represent homologues as well. Generally the decision whether an E-value is biologically significant depends on the size of database that is searched Sometimes a real match has an E value > 1 Sometimes a similar E value occurs for a short exact match and long less exact match

How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Sometimes correction to the model are needed to infer biological significance

Gap Scores Standard solution: affine gap model w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

Significance of Gapped Alignments Gapped alignments use same statistics and K cannot be easily estimated Empirical estimations and gap scores determined by looking at random alignments

BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein

Choose the BLAST program ProgramInputDatabase 1 blastnDNADNA 1 blastpproteinprotein 6 blastxDNAprotein 6 tblastnprotein DNA 36 tblastxDNA DNA

Blast Example >tr|Q47644|Q47644_ECOLX Orf protein MKKMFIAVLFALVSVNAMAADCAKGKIEF SKYNEDNTFTVKVSGREYWTNRWNLQPL LQS AQLTGMTVTIISNTCSSGSGFAQVKFN ????

Alignment between the Orf protein and a Shiga toxin from pathogenic bacteria

What about harder cases : retinol-binding protein odorant-binding protein apolipoprotein D RBP4 PAEP

Assessing whether proteins are functionally homologous RBP4 and PAEP: Low bit score, E value 0.49, 24% identity but they are indeed homologous. PAEP- Pregnancy protein RBP4- Retinol Binding Protein

From Pairwise Alignment to Database Similarity Search.

Similar presentations

Presentation on theme: "From Pairwise Alignment to Database Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

From Pairwise Alignment to Database Similarity Search.

Similar presentations

Presentation on theme: "From Pairwise Alignment to Database Similarity Search."— Presentation transcript:

Similar presentations

About project

Feedback