Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Similar presentations


Presentation on theme: "Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW."— Presentation transcript:

1 Sequence Alignment Lakshmanan Iyer, Ph. D.

2 The Building Blocks… ATGC VLMFNQEDHKRCSTPYW

3 Why Align Sequences? Discover functional, structural, and evolutionary information Similar Sequences may have similar function –Gene Regulation –Biochemical Function –Similar Structure Homology –Similar sequences may have a common ancestor

4 What is Sequence Alignment? Local Alignment Global Algnment LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA -------TGKGS------- ||| -------AGKGA------- http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/similarity.html

5 Example Sequence Alignment? Evolutionary Tree Example Alignment Conserved Similar

6 Methods of Sequence Alignment Pair-wise Sequence Alignment Multiple Sequence Alignment Dot Matrix Analysis Dynamic Programming Algorithm Word or k-tuple methods (FASTA,BLAST, BLAT)

7 Dot Matrix Alignment Place Sequences on X and Y axis and put a dot where there is a match Especially useful to detect repetitive structure

8 Dynamics Programming The problem at hand is diving into a series of sub-problems The sub-problems are solved in steps The results are compiled to find the final solution.

9 Scoring Systems Position Independent MatricesPosition Independent Matrices Nucleic Acids – identity matrix Proteins PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs) PSI and RPS BLAST

10 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V XBLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

11 Gapped Alignments Gapping provides more biologically realistic alignments Gapped BLAST parameters must be simulated Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b) LGPSSKQTGKGS-SRIWDN | | ||| | | LN-ITKSAGKGAIMRLGDA -------TGKGS------- ||| -------AGKGA-------

12 Scores V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11

13 H E A H E A P -2 -1 -1 A -2 -1 4 W -2 -3 -3 H E A H E A P -2 -1 -1 A -2 -1 4 W -2 -3 -3 0-8-16 -8 -16 -24 -24 -2-9 -3-5 -6 -17 -11-18 -10 W A P HEA Calculate scores for site pairs BLOSUM62 BLOSUM62 D DYNAMIC PROGRAMMING Global Alignment: Needleman- Wunsch

14 H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73 A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61 W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41 H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26 E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13 A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6 E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8 H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -9 -17 -25 -33 -41 -49 -57 -65 -73 A -16 -10 -3 -5 -13 -21 -29 -37 -45 -53 -61 W -24 -18 -11 -6 -7 -15 -10 -18 -26 -34 -41 H -32 -16 -18 -13 -8 -9 -17 -12 -10 -18 -26 E -40 -24 -11 -19 -15 -9 -12 -19 -12 -5 -13 A -48 -32 -19 -7 -15 -11 -12 -12 -20 -13 -6 E -56 -40 -27 -15 -9 -16 -14 -14 -12 -15 -8 -8-8 -13 -12 -10 -21 -25 -17 -16 -8 H E A G A W G H E E - - P - A W H E A E H E A G A W G H E E - - P - A W H E A E Trace Back

15 BLAST… NCBI Presentation …

16 NCBI Molecular Biology Resources January 2006 Peter Cooper Using NCBI BLAST

17 Sequence Similarity Searching Basic Local Alignment Search Tool

18 What BLAST tells you BLAST reports surprising alignments –Different than chance Assumptions –Random sequences –Constant composition Conclusions –Surprising similarities imply evolutionary homology Evolutionary Homology: descent from a common ancestor Does not always imply similar function

19 Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. –DNA vs DNA –DNA translation vs Protein –Protein vs Protein –Protein vs DNA translation –DNA translation vs DNA translation www, standalone, and network clients

20 BLAST and BLAST-like programs Traditional BLAST (blastall) nucleotide, protein, translations –blastn nucleotide query vs. nucleotide database –blastp protein query vs. protein database –blastx nucleotide query vs. protein database –tblastn protein query vs. translated nucleotide database –tblastx translated query vs. translated database Megablast nucleotide only –Contiguous megablast Nearly identical sequences –Discontiguous megablast Cross-species comparison Position Specific BLAST Programs protein only –Position Specific Iterative BLAST (PSI-BLAST) Automatically generates a position specific score matrix (PSSM) –Reverse PSI-BLAST (RPS-BLAST) Searches a database of PSI-BLAST PSSMs

21 GTACTGGACATGGACCCTACAGGAACGT TGGACATGGACCCTACAGGAACGTATAC CATGGACCCTACAGGAACGTATACGTAA... Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT... Make a lookup table of words GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query 11-mer 1228 megablast 711 blastn Min.Def.WORD SIZE

22 Protein Words GTQITVEDLFYNIATRRKALKN Query : Neighborhood Words LTV, MTV, ISV, LSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF... Make a lookup table of words Word size = 3 (default) Word size can only be 2 or 3

23 Minimum Requirements for a Hit Nucleotide BLAST requires one exact match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN ATCGCCATGCTTAATTGGGCTT CATGCTTAATT neighborhood words exact word match one match two matches

24 An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

25 Megablast: NCBI’s Genome Annotator Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast –exact word match –Word size 28 Discontiguous Megablast –initial word hit with mismatches –cross-species comparison

26 Templates for Discontiguous Words W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 W = word size; # matches in template t = template length (window size within which the word match is evaluated)

27 Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Score Alignments (applies to ungapped alignments) E = Kmne - S or E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance size of database your score expected number of random hits

28 Scoring Systems Position Independent MatricesPosition Independent Matrices Nucleic Acids – identity matrix Proteins PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSMs)Position Specific Score Matrices (PSSMs) PSI and RPS BLAST

29 A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

30 Position Specific Substitution Rates Active site serine Typical serine

31 Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Serine scored differently in these two positions Active site nucleophile

32 Gapped Alignments Gapping provides more biologically realistic alignments Gapped BLAST parameters must be simulated Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

33 Scores V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7 PAM30 +7 +2 0 -10 +10 +2 11

34 The Flavors of BLAST Position independent scoring –Standard BLAST traditional contiguous word hit nucleotide, protein and translations –Megablast can use discontiguous words nucleotide only optimized for large batch searches Position dependent scoring –PSI-BLAST constructs PSSMs automatically searches protein database with PSSMs –RPS BLAST searches a database of PSSMs basis of conserved domain database

35 WWW BLAST

36 The BLAST homepage Specialized Databases Standard databases

37 BLAST Databases: Non-redundant protein nr ( non-redundant protein sequences ) –GenBank CDS translations –NP_ RefSeqs –Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples nr ( non-redundant protein sequences ) –GenBank CDS translations –NP_ RefSeqs –Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples

38 Nucleotide Databases: Genomic Human and mouse genomes and reference transcripts now available

39 Nucleotide Databases: Standard

40 Nucleotide Databases: Traditional nr (nt) –Traditional GenBank –NM_ and XM_ RefSeqs refseq_rna refseq_genomic –NC_ RefSeqs dbest –EST Division est_human, mouse, others htgs –HTG division gss –GSS division wgs –whole genome shotgun env_nt –environmental samples

41 3000 Myr 1000 Myr 540 Myr Alzheimer’s Disease Ataxia telangiectasia Colon cancer Pancreatic carcinoma YeastBacteriaWormFlyHuman BLAST and Molecular Evolution MLH1 MutL

42 Protein BLAST Page >Mutated in Colon Cancer IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILE VQQHIESKLLGSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGS DKVYAHQMVRTDSREQKLDAFLQPLSKPLSS Protein database

43 Advanced Options: Entrez limit all[Filter] NOT mammals[Organism] gene_in_mitochondrion[Properties] 2003:2005 [Modification Date] tpa[Filter] Nucleotide biomol_mrna[Properties] biomol_genomic[Properties] all[Filter] NOT mammals[Organism] gene_in_mitochondrion[Properties] 2003:2005 [Modification Date] tpa[Filter] Nucleotide biomol_mrna[Properties] biomol_genomic[Properties]

44 Advanced Options: Filters Hides low complexity for initial word hits only Hides low complexity for initial word hits only Masks regions of query in lower case (pre-masked) Masks regions of query in lower case (pre-masked) Masks Human or Mouse Interspersed repeats. Default for genome searches. Masks Human or Mouse Interspersed repeats. Default for genome searches. Protein Nucleotide Masks Low Complexity Sequence with X or n Masks Low Complexity Sequence with X or n

45 Advanced Options: Composition based stats Amino acid composition: Ala (A) 42 19.6% Arg (R) 4 1.9% Asn (N) 4 1.9% Asp (D) 1 0.5% Cys (C) 0 0.0% Gln (Q) 2 0.9% Glu (E) 6 2.8% Gly (G) 13 6.1% His (H) 0 0.0% Ile (I) 3 1.4% Leu (L) 10 4.7% Lys (K) 57 26.6% Met (M) 0 0.0% Phe (F) 1 0.5% Pro (P) 19 8.9% Ser (S) 23 10.7% Thr (T) 14 6.5% Trp (W) 0 0.0% Tyr (Y) 1 0.5% Val (V) 14 6.5% Negatively charged residues (Asp + Glu): 7 Positively charged residues (Arg + Lys): 61 Amino acid composition: Ala (A) 42 19.6% Arg (R) 4 1.9% Asn (N) 4 1.9% Asp (D) 1 0.5% Cys (C) 0 0.0% Gln (Q) 2 0.9% Glu (E) 6 2.8% Gly (G) 13 6.1% His (H) 0 0.0% Ile (I) 3 1.4% Leu (L) 10 4.7% Lys (K) 57 26.6% Met (M) 0 0.0% Phe (F) 1 0.5% Pro (P) 19 8.9% Ser (S) 23 10.7% Thr (T) 14 6.5% Trp (W) 0 0.0% Tyr (Y) 1 0.5% Val (V) 14 6.5% Negatively charged residues (Asp + Glu): 7 Positively charged residues (Arg + Lys): 61 Histone H1

46 BLAST Formatting Page Conserved Domain

47 BLAST Output: Graphical Overview mouse over Sort by taxonomy

48 BLAST Output: Descriptions Link to entrez Sorted by e values 3 X 10 -12 Default e value cutoff 10 Gene Linkout

49 TaxBLAST: Taxonomy Reports

50 >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%) Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ L Sbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%) Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ L Sbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 BLAST Output: Alignments Identical match positive score (conservative) positive score (conservative) negative substitution gap

51 Low Complexity Filter >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756 Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%) Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA Sbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct 396 FLQPLSKPLSS 406 >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 Length=756 Score = 231 bits (589), Expect = 1e-62 Identities = 131/131 (100%), Positives = 131/131 (100%), Gaps = 0/131 (0%) Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query 61 GSNSSRMYFTQTLLPGLAGPSGEMVKsttsltssstsgssDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA Sbjct 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct 396 FLQPLSKPLSS 406 low complexity sequence filtered

52 Nucleotide: Human Repeats Human Albumin Genomic Region Human Albumin Genomic Region

53 Nucleotide: Human Repeat Filter Alb mRNAs

54 Nucleotide BLAST: New Output Crab-eating macaque CDC20 mRNA Default human database New output display

55 Sortable Results Pseudogene on Chromosome 9 Functional Gene on Chromosome 1 Separate Sections for Transcript and Genome

56 Total Score: All Segments Functional Gene Now First

57 Sorting in Exon Order Default Sorting Order: Score Longest exon usually first Default Sorting Order: Score Longest exon usually first Query start position Exon order Query start position Exon order

58 Links to Map Viewer Chromosome 1 Chromosome 9

59 Service Addresses General Help info@ncbi.nlm.nih.gov BLAST blast-help@ncbi.nlm.nih.gov Telephone support: 301- 496- 2475

60 Back to Multiple Sequence Alignment

61 Multiple Sequence Alignment An extension of the pair-wise alignment… –We will learn by example –We will use Jalview to learn it

62 Jalview Viewing –Reads and writes alignments –save alignments and associated trees Editing –Inserted/delete Gaps –Insert/delete gaps in groups of sequences. –Remove of gapped columns Analysis –Align sequences using Web Services –Amino acid conservation analysis –Alignment sorting options (by name, tree order, percent identity, group) –UPGMA and NJ trees calculated and drawn –Sequence clustering using principal component analysis. –Removal of redundant sequences. –Smith Waterman pairwise alignment of selected sequences.

63 Acknowledgement Dr. Peter Cooper at NCBI for permission to use the BLAST Powerpoint presentation Dr. Kurt Wollenberg for slides on Dynamic Programming


Download ppt "Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW."

Similar presentations


Ads by Google