Download presentation
Presentation is loading. Please wait.
Published byJoy Willis Modified over 9 years ago
1
designed by Manisha, NUS Part I : SEQUENCE COMPARISON PAIRWISE ALIGNMENT Manisha Brahmachary
2
designed by Manisha, NUS OUTLINE zWhat is sequence Comparison zWays to do Sequence Comparison zDot Plot zBLAST zFASTA
3
designed by Manisha, NUS What is sequence alignment or sequence comparison? zGiven two sequences of letters and a scoring scheme for evaluating matching letters, find best pairing from one sequence to letters of the other sequence. z THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. z THIS IS A SHORT SENTENCE z Align: z THIS IS A RATHER LONGER SENTENCE THAN THE NEXT. z THIS IS A#######SHORT###SENTENCE############## (path 1) z or z THIS IS A SHORT#########SENTENCE############## (path 2)
4
designed by Manisha, NUS Aligning biological sequences z DNA (4 letter alphabet) y TTGACAC yTTTACAC z Proteins (20 letter alphabet) y RKVA--GMAKPNM y RKIAVAAASKPAV
5
designed by Manisha, NUS Why do Sequence Alignment? Finding novel genes in silico Phylogenetic/Evolutionary Structure-template for modelling Functional prediction
6
designed by Manisha, NUS Types of Sequence Comparison Pairwise Alignment yComparison of two sequences Multiple Alignment yComparison of more than two sequences
7
designed by Manisha, NUS CONCEPTS IN SEQUENCE COMPARISON IDENTITY zPercentage identity between sequences means that they have a certain number of residues (nucleotide /amino- acids ) that are identical at that particular position after aligning both sequences.
8
designed by Manisha, NUS Exact match (shown by | ) : 10 identical residues Above example : Percentage identity: 10 identical matches /15 residues in the aligned sequence *100 = 66% identity RCI CTRGFCRCLCRR RCLCRRGVCRCICT R Query: Subject:
9
designed by Manisha, NUS MISMATCH(s) HERE RCI CTRGFCRCLCRR RCLCRRGVCRCICT R Query: Subject:
10
designed by Manisha, NUS Mismatch when different characters, therefore insertion of gaps. Gaps have penalties: Insertion of first gap( GAP OPENING) : high penalty (For eg. –2, subtracting 2 ) Insertion of consecutive gaps ( GAP EXTENSION): less penalty (For eg. -1 (subtracting 1 for each consecutive gap) More no. of gaps lesser the score of the alignment RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR Query: Subject:
11
designed by Manisha, NUS Substitution : Less score than identical match For eg: +1 per substitution RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR-
12
designed by Manisha, NUS zSubstitution - Replace a residue with another of similar physiochemical property. Ile (I) Leu (L) Met (M) Val (V)Hydrophobic Ala (A) Cys (C) Gly (G) Pro (P) Ser (S) Thr (T)Hydrophilic Phe (F) Tyr (Y) Trp (W)Aromatic His (H) Lys (K) Arg (R)Basic Asp (D) Glu(E) Asn (N) Gln (Q)Acids and Amides Amino AcidCategory
13
designed by Manisha, NUS Similarity zSimilarity = Identical matches + Substitutions zEg. (10 identical matches + z 2 substitution) / 15 aligned residues * 100 = 80% similarity RCICT-RGFCRCLC---RR RCLCRRGVCRCICTAR
14
designed by Manisha, NUS ACTCGGCCCCGCG CTCACTG C ACTCGGAC - - GCG CTCAGTGC For DNA: Identity and gap are applicable
15
designed by Manisha, NUS Similarity Vs. Homology Homology: When two similar proteins come from a common ancestor. zHomology is inferred from Similarity zIf two sequences are similar, then they are known as homologous sequences. zUsually, at least 30% identity over 400 bp for DNA sequences and over 125 amino acids for proteins.
16
designed by Manisha, NUS Scoring Matrices used in sequence comparison zWhat is a scoring Matrix: zScoring matrices are used when we compare sequences with one another zGives us a measure of which residue can be substituted by which residue.
17
designed by Manisha, NUS z zFor Amino acids, Each amino acid is compared to every other and a score is given to this pair zHigh score if they are the same residue (e.g. Cysteine compared to cysteine) z Low, if they are very different (e.g. Tryptophan compared to cysteine) Scoring Matrices
18
designed by Manisha, NUS Scoring Matrices A C G T ACGTACGT 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 zDNA sequence: 4 characters only (A,T,G,C) zUnitary matrix used for scoring: yA scoring system in which only identical characters receive a positive score. for DNA:
19
designed by Manisha, NUS SCORING SCHEMES FOR PROTEIN SEQUENCE ALIGNMENTS z Scoring matrices used are: PAM( Point Accepted Mutation ) and BLOSUM( BLOcks SUbstitution Matrix z BLOSUM45---->BLOSUM 90 means MORE DIVERSE TO LESS DIVERSE zPAM30--- PAM250 means LESS DIVERSE TO MORE DIVERSE NOTE: Many different matrices are in use, each gives different values to pairs of amino acids Depending on how distantly related your sequences are, you might want to choose different matrices for your comparisons
20
designed by Manisha, NUS Scoring Matrices BLOSUM 45 BLOSUM62 BLOSUM90 PAM250 PAM160 PAM 100 MORE DIVERGENT LESS DIVERGENT Notes:
21
designed by Manisha, NUS Ways to do Pairwise Alignment yDot Plot (simplest method) zStatistical computation based yLocal alignment e.g. BLAST, FASTA yGlobal alignment e.g. CLUSTAL
22
designed by Manisha, NUS What are Dot Plots Program to do sequence comparison to find out: –Are the two sequences similar ? – Are there Repeat regions in your sequence?
23
designed by Manisha, NUS STEPS IN DOT PLOT Take two sequences to be compared Sequence A:MEHRKPGTGQ Sequence B:MEHRKPGTGQ Place sequence A in x-axis (Row). Place sequence B in y- axis (Column) X-axis M E H R K P G T G Q Y-axis
24
designed by Manisha, NUS Plot a dot everytime there is a match between an element of row sequence and an element of column sequence Do you see any diagonal line extending? If yes, then there is a match !
25
designed by Manisha, NUS Patterns in Dot Plot When two sequences are “identical” Sequence : GGTCCTTGGCTGAAAG ACCCCA GGTCCTTGGCTGAAAGACCCCA
26
designed by Manisha, NUS Application of Dot Plot zUsing self comparison : Finding Repeats Sequence used: Human ALU sequence CATCTCAAAAACAACAA CAAAAAAAAAAAAAAAA GAAAAAAAA Omit main diagonal Clusters of diagonal lines show repeats in the sequence. CATCTCAAAAACAACAACAAAAAAAAAAAAAAAAGAAAAAAAA
27
designed by Manisha, NUS Notes:What are repeats? zRepeats:are stretches of repeated regions of residues in a sequence. zImportance of repeats: zIn protein: yRegulatory regions yBinding sites yIn DNA: yPresent in Transposons, chromosomal mutational hotspots, many genetic diseases related with repeats.eg.Huntington.
28
designed by Manisha, NUS Patterns in Dot Plot When two sequences are similar : Broken diagonal,the interrupted region shows regions of mismatch GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPACYCYG MKGMILFISCLLLIDIVVGGKEGYLMDHEGCKLSCFIRPSGYCGRECTLKKGS
29
designed by Manisha, NUS Patterns in Dot Plot Two different, but related sequences Broken diagonal clusters of dots parallel to the central diagonal. Distance between the lines show no. of insertions done to get the alignment. GREGYPADSKGCKITCFLTAAGYCNTECTLKKGSSGYCAWPA ARDGYPVDEKGCKLSCLINDKWCNSACHSRGGKYGYCYTGGL
30
designed by Manisha, NUS Two models of alignment: Local and Global alignments zGlobal alignment: y Looks for similarity across full extent of sequences xSite:http://www2.igh.cnrs.fr/bin/align-guess.cgi
31
designed by Manisha, NUS GLOBAL Alignment zThe two sequences are matched across their whole sequence length.
32
designed by Manisha, NUS Local alignment zLooks for regions of similarity in parts of the sequences only Softwares : BLAST, FASTA
33
designed by Manisha, NUS Local Alignment zExample of local alignment between two sequences using lalign program. (http://www.ch.embnet.org/software/LALI GN_form.html) zNotice that the alignment is shown only of those regions that have strong identity or strong similarity
34
designed by Manisha, NUS Why two different models? Global alignment zHigh degree of Homology zGood for modelling Local Alignment zLocalised Similarity ( conserved regions with structural, functional importance, Repeats, Domains)
35
designed by Manisha, NUS FASTA zFast Alignment (expanded form of FASTA)by Pearson and Lipmann. zIs a method based on dynamic programming. zWebsites available: zhttp://www.ebi.ac.uk/fasta33/ zhttp://www.dna.affrc.go.jp/htdocs/Blast/fast a.html
36
designed by Manisha, NUS What is BLAST? zBasic Local Alignment Search Tool (BLAST) zMethod for Pairwise Alignment. zIs used to search for homologous sequences from a database (of nucleotide/protein sequence) for a given query sequence. zModified version of FASTA yFaster in generating output. zSites for doing BLAST: xhttp://www.ncbi.nlm.nih.gov
37
designed by Manisha, NUS How to go about doing BLAST SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDTVYCPRHVICTAEDMLNPNYE DLLIRKSNHSFLVQAGNVQLRVIGHSMQNCLLRLKVDTSNPKTPKYKFVRIQPGQTF SVLACYNGSPSGVYQCAMRPNHTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMEL PTGVHAGTDLEGKFYGPFVDRQTAQAAGTDTTITLNVLAWLYAAVINGDRWFLNRF TTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCAALKELLQNGMNGR T ILGSTILEDEFTPFDVVRQCSGVTFQ SARS virus gene:
38
designed by Manisha, NUS
39
BLAST output for a protein query sequence from a SARS virus Score (bits) is the score given letter by letter during alignment based on the Subtitution matrices. High score = less E value.
40
designed by Manisha, NUS E value: No. of chance alignments that one will get as hits. Lower the E value lesser no. of chance hits E value of zero or less than zero indicates very good hit (highly homologous sequence) E value is also known as P(N) in some BLAST programs
41
designed by Manisha, NUS Gives the identity Gives the similarity BLAST OUTPUT
42
designed by Manisha, NUS z BLAST query schemes: Amino acid seq: against db? y Blastp (protein sequence db) y Tblastn (translated nucleotide sequence db) DNA seq: against db? y Blastn (nucleotide db) y Blastx ( protein sequence db) y Tblastx (translated nucleotide sequence db) BLAST
43
designed by Manisha, NUS Gene(CDNA), Unknown CTAACATGCTTAGGATAATGGCCTCTCTTGTTCTTGCTCGCAAACATAACACTT GCTGTAACTTATCACA NMLRIMASLVLARKHNTC CNLSHRFYRLANECAQVL SEMVMCGGSLYVKPGGT SSGDATTAYANSVFNIC Choose the best hit using the lowest E value, highest %identity Function, family of gene found Find conserved regions, Domains, Phylogenetic relations:which family of gene closest to your target gene/protein Translate into 6 frames, Amino acid seq.choose appropriate frame. BLAST BLAST RESULTS If, High % identity and low e-value DNA Sequencing CLUSTAL Use multiple sequences
44
designed by Manisha, NUS SUMMARY zTODAY WE LOOKED AT: Methods to compare two sequences: y Dot plots (simplest, graphical view) y Different patterns of Dot plots y Local alignment y Global alignment y Difference between these two models yFASTA y BLAST y other types of BLAST
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.