Download presentation
Presentation is loading. Please wait.
Published byAlexandrina Walsh Modified over 8 years ago
1
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram
2
MGM workshop. 19 Oct 2010 Outline Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast Multiple Sequence Alignment ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA, DIALIGN, Match-Box, Multalin, MUSCA Phylogenetic analysis and tree construction BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr, POWER, BlastO, TraceSuite II HMM Protein family profiles http://expasy.org/tools/
3
MGM workshop. 19 Oct 2010 Alignment Insert spaces in arbitrary locations -> same length and no two spaces in the same position. Find arrangement of two sequences to identify regions of similarity
4
MGM workshop. 19 Oct 2010 Alignment methods: Dot plots
5
MGM workshop. 19 Oct 2010 Global vs Local alignment Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another Local alignment: An alignment that searches for segments of the two sequences that match well It may seem that one should always use local alignments! However each has its application
6
MGM workshop. 19 Oct 2010 Substitution matrices http://www.russelllab.org/aas/
7
MGM workshop. 19 Oct 2010 Scoring an alignment
8
MGM workshop. 19 Oct 2010 Global alignment S1=HGSAQVKGHG S2=KTEAEMKASEDLKKHGT
9
MGM workshop. 19 Oct 2010 KTEAEMKAESEDLKKHGT --HG--SA--Q-VKGHG-
10
MGM workshop. 19 Oct 2010 Local Alignment
11
MGM workshop. 19 Oct 2010 How BLAST works Blast uses pre-indexed databases It remembers the location of every ‘word’ of each database entry Identify High scoring Segment Pairs (HSP) Default word lengths 11bp or 3aa When two non-overlapping words within a certain distance of each other in the query are matched against a database entry the region of the two sequences is called a segment pair. Slide query and target sequences across each other until the maximum number of HSPs for that target is found Each segment pair is extended untiil the score drops by X below its maximum value Score the alignment A scoring matrix is used Gaps introduced between HSP during sliding get negative score A match gets a positive score Total alignment score is subjected to statistical analysis to calculate the significance vs. chance of the score Repeat for every sequence in the database Return total results
12
MGM workshop. 19 Oct 2010 How BLAST works MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGY CGSTDPYCGTGCQSQCGGGG VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCG STIDYCGPGCQSQCGG Common 3mer GCQSQCGG extend Query Subject (database) ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG HSP Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%) Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG 58 ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG Sbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67
13
MGM workshop. 19 Oct 2010 Types of Blast Nucleic sequence: atcgatatatatagactgactgact Protein sequence: MTAVYHILRALRARARVARARVH 6 frame translation Nucleic acids sequence database Protein seqeunces database blastn blastp 6 frame translation tblastx blastx tblastn Database Query
14
MGM workshop. 19 Oct 2010
15
Exact multiple alignment by dynamic programming Compexity= O(n S 2 S S 2 ) N: length of sequences S: number of sequences Only feasible for 4-5 sequences max.
16
MGM workshop. 19 Oct 2010
17
Neighbor Joining
18
MGM workshop. 19 Oct 2010 Unrooted NJ tree
19
MGM workshop. 19 Oct 2010 Comparison of Multiple sequence alignment programs
20
MGM workshop. 19 Oct 2010 Primary sequence changes:
21
MGM workshop. 19 Oct 2010 Profiles CGGSV 0.8 * 0.4 * 0.8 * 0.6 * 0.2 =.031 ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln( 0.2) = -3.48
22
MGM workshop. 19 Oct 2010 Hidden Markov Models Assumptions: Observations are ordered Random process can be represented by a stochastic finite state machine with emitting states Probabilistic parameters of a Hidden Markov Model x – states, y – possible observations a – state transition probabilities, b –output/emision probabilities
23
MGM workshop. 19 Oct 2010 HMM estimation, usage & applications Training/Estimation Feed an architecture (given in advance) a set of observation sequences The training process will iteratively alter its parameters to fit the training set The trained model will assign the training sequences high probabilities Usage Evaluate the probability of an observation sequence given the model (Forward) Find the most likely path through the model for a given observation sequence (Viterbi) Applications Gene finding Protein family modeling …
24
MGM workshop. 19 Oct 2010 Profile HMMs Families of functional biological sequences Primary sequences have diverged due to evolution, while maintaining structure/function. Questions: Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin? Given a set of sequences, find more sequences of the same family
25
MGM workshop. 19 Oct 2010
26
Trade offs AdvandagesDisadvandages Statistics Modularity Transparency Prior knowledge State independence Over – fitting Local maximums Speed
27
MGM workshop. 19 Oct 2010 Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.