Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?

Similar presentations


Presentation on theme: "Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?"— Presentation transcript:

1 Database Similarity Search

2 2 Sequences that are similar probably have the same function Why do we care to align sequences?

3 new sequence ? Sequence Database ≈ Similar function Discover Function of a new sequence

4 4

5 Searching Databases for similar sequences Naïve solution: Use exact algorithm to compare each sequence in the database to query. Is this reasonable ?? How much time will it take to calculate?

6 Complexity for genomes Human genome contains 3  10 9 base pairs –Searching an mRNA against HG requires ~10 12 cells -Even efficient exact algorithms will be extremely slow when preformed millions of times even with parallel computing.

7 So what can we do?

8 Searching databases Solution: Use a heuristic (approximate) algorithm

9 Heuristic strategy Reduce the search space Remove regions that are not useful for meaningful alignments Perform efficient search strategies Preprocess database into new data structure to enable fast accession

10 Heuristic strategy Reduce the search space Remove regions that are not useful for meaningful alignments Preprocess database into new data structure to enable fast accession

11 AAAAAAAAAAA ATATATATATATA Transposable elements What sequences to remove? 53% of the genome is repetitive DNA Low complexity sequences (JUNK???)

12 Low Complexity Sequences What's wrong with them? * Not informative * Produce artificial high scoring alignments. So what do we do? We apply Low Complexity masking to the database and the query sequence Mask TCGATCGTATATATACGGGGGGTA TCGATCGNNNNNNNNCNNNNNNTA

13 Heuristic strategy Remove low-complexity regions that are not useful for meaningful alignments Perform efficient search strategies Preprocess database into new data structure to enable fast accession

14 BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment using an exact algorithm. Altschul et al 1990

15 DNA/RNA vs protein alphabet DNA(4) A T G C RNA(4) A U G C Protein (20) ACDEFGHIKLMNPQRSTVWY A T=A G…. A G>>A W…. WHY is it different?

16 The 20 Amino Acids

17 A W G

18 Scoring system for amino acids mismatches

19 BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment using an exact algorithm. Altschul et al 1990

20 BLAST (Protein Sequence Example) First, identify (most efficiently) short almost exact matches between the query sequence and the database. Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA

21 BLAST FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG.. SVT. GSW. TWF.. WYS…. Preprocessing of the database Seq 1 FSGTWYA FSG, SGT, GTW, TWY, WAY Seq 2 FDRTSYVFDR, DRT, RTS, TSY, SYV Seq 3SWRTYVASWR, WRT,RTY, TYV, YVA ……. Seq 3546 Seq 102 Seq 1 BAG OF WORDS

22 BLAST Query sequence …FSGTWYA… Words of length 3: FSG, SGT, GTW, TWY, WYA… DATABASE FSG SGT GTW TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS…. SEQ N INVIEIAFDGTWTCATTNAMHEWASNINETEEN

23 BLAST Basic Local Alignment Search Tool General idea - a good alignment contains subsequences of high identity (local alignment): ACGCCCGGGAGCGC CTGGGCGTATAGCCC –First, identify (most efficiently) short almost exact matches. –Next, extended to longer regions of similarity. –Finally, optimize the alignment an exact algorithm. Altschul et al 1990

24 BLAST 2.Extend word pairs as much as possible, i.e., as long as the total score increases High-scoring Segment Pairs (HSPs) Q: FIRSTLINIHFSGTWYAAMESIRPATRICKREAD D: INVIEIAFDGTWTCATTNAMHEWASNINETEEN Q= query sequence, D= sequence in database 3. Finally, optimize the alignment using an exact algorithm.

25 Running BLAST to predict a function of a new protein >Arrestin protein (C. elegance) MFIANNCMPQFRWEDMPTTQINIVLAEPRCMAGEFFNAKVLLDSSDPDTVVHSFCAEIKG IGRTGWVNIHTDKIFETEKTYIDTQVQLCDSGTCLPVGKHQFPVQIRIPLNCPSSYESQF GSIRYQMKVELRASTDQASCSEVFPLVILTRSFFDDVPLNAMSPIDFKDEVDFTCCTLPF GCVSLNMSLTRTAFRIGESIEAVVTINNRTRKGLKEVALQLIMKTQFEARSRYEHVNEKK LAEQLIEMVPLGAVKSRCRMEFEKCLLRIPDAAPPTQNYNRGAGESSIIAIHYVLKLTAL PGIECEIPLIVTSCGYMDPHKQAAFQHHLNRSKAKVSKTEQQQRKTRNIVEENPYFR

26

27

28 How to interpret a BLAST score: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

29 The expectation value E-value is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. page 105 How to interpret a BLAST search: For each blast score we can calculate an expectation value (E-value)

30

31 BLAST- E value: Increases linearly with length of query sequence Increases linearly with length of database Decreases exponentially with score of alignment –K,λ: statistical parameters dependent upon scoring system and background residue frequencies m = length of query ; n= length of database ; s= score

32 What is a Good E-value (Thumb rule) E values of less than 0.00001 show that sequences are almost always related. Greater E values, can represent functional relationships as well. Sometimes a real (biological) match has an E value > 1 Sometimes a similar E value occurs for a short exact match and long less exact match

33 How to interpret a BLAST search: The score is a measure of the similarity of the query to the sequence shown. How do we know if the score is significant? -Statistical significance -Biological significance

34 Treating Gaps in BLAST >Human DNA CATGCGACTGACcgacgtcgatcgatacgactagctagcATCGATCATA >Human mRNA CATGCGACTGACATCGATCATA Sometimes correction to the model are needed to infer biological significance

35 Gap Scores Standard solution: affine gap model w x = g + r(x-1) w x : total gap penalty; g: gap open penalty; r: gap extend penalty ;x: gap length –Once-off cost for opening a gap –Lower cost for extending the gap –Changes required to algorithm

36 Gapped BLAST 4. Connect several HSPs by aligning the sequences in between them: THEFIRSTLINIHFSGTWYAA____M_ESIRPATRICKREAD INVIEIAFDGTWTCATTNAMHEW___ASNINETEEN The Gapped Blast algorithm allows several segments that are separated by short gaps to be connected together to one alignment

37 BLAST BLAST is a family of programs Query:DNAProtein Database:DNAProtein


Download ppt "Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?"

Similar presentations


Ads by Google