MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram
MGM workshop. 19 Oct 2010 Outline Pairwise Alignment Global/Local, Scoring BLAST, BLAT, SIM, LALIGN, Dotlet, Ublast Multiple Sequence Alignment ClustalW, Kalign, MAFFT, Muscle, T-Coffee, MSA, DIALIGN, Match-Box, Multalin, MUSCA Phylogenetic analysis and tree construction BIONJ, DendroUPGMA, PHYLIP, PhyML, Phylogeny.fr, POWER, BlastO, TraceSuite II HMM Protein family profiles
MGM workshop. 19 Oct 2010 Alignment Insert spaces in arbitrary locations -> same length and no two spaces in the same position. Find arrangement of two sequences to identify regions of similarity
MGM workshop. 19 Oct 2010 Alignment methods: Dot plots
MGM workshop. 19 Oct 2010 Global vs Local alignment Global alignment: An alignment that assumes that the two sequences are basically similar over the entire length of one another Local alignment: An alignment that searches for segments of the two sequences that match well It may seem that one should always use local alignments! However each has its application
MGM workshop. 19 Oct 2010 Substitution matrices
MGM workshop. 19 Oct 2010 Scoring an alignment
MGM workshop. 19 Oct 2010 Global alignment S1=HGSAQVKGHG S2=KTEAEMKASEDLKKHGT
MGM workshop. 19 Oct 2010 KTEAEMKAESEDLKKHGT --HG--SA--Q-VKGHG-
MGM workshop. 19 Oct 2010 Local Alignment
MGM workshop. 19 Oct 2010 How BLAST works Blast uses pre-indexed databases It remembers the location of every ‘word’ of each database entry Identify High scoring Segment Pairs (HSP) Default word lengths 11bp or 3aa When two non-overlapping words within a certain distance of each other in the query are matched against a database entry the region of the two sequences is called a segment pair. Slide query and target sequences across each other until the maximum number of HSPs for that target is found Each segment pair is extended untiil the score drops by X below its maximum value Score the alignment A scoring matrix is used Gaps introduced between HSP during sliding get negative score A match gets a positive score Total alignment score is subjected to statistical analysis to calculate the significance vs. chance of the score Repeat for every sequence in the database Return total results
MGM workshop. 19 Oct 2010 How BLAST works MLVTTILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGY CGSTDPYCGTGCQSQCGGGG VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCG STIDYCGPGCQSQCGG Common 3mer GCQSQCGG extend Query Subject (database) ++ L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG HSP Score = 66.6 bits (161), Expect = 3e-12, Method: Compositional matrix adjust. Identities = 32/53 (60%), Positives = 39/53 (74%), Gaps = 0/53 (0%) Query 6 ILAFALFKNSYAQQCGSQAGGALCSNRLCCSKFGYCGSTDPYCGTGCQSQCGG L SY QCG++AGGALC LCCS++G+CGST YCG GCQSQCGG Sbjct 15 VVWMLLVGGSYGVQCGTEAGGALCPRGLCCSQWGWCGSTIDYCGPGCQSQCGG 67
MGM workshop. 19 Oct 2010 Types of Blast Nucleic sequence: atcgatatatatagactgactgact Protein sequence: MTAVYHILRALRARARVARARVH 6 frame translation Nucleic acids sequence database Protein seqeunces database blastn blastp 6 frame translation tblastx blastx tblastn Database Query
MGM workshop. 19 Oct 2010
Exact multiple alignment by dynamic programming Compexity= O(n S 2 S S 2 ) N: length of sequences S: number of sequences Only feasible for 4-5 sequences max.
MGM workshop. 19 Oct 2010
Neighbor Joining
MGM workshop. 19 Oct 2010 Unrooted NJ tree
MGM workshop. 19 Oct 2010 Comparison of Multiple sequence alignment programs
MGM workshop. 19 Oct 2010 Primary sequence changes:
MGM workshop. 19 Oct 2010 Profiles CGGSV 0.8 * 0.4 * 0.8 * 0.6 * 0.2 =.031 ln(0.8)+ln(0.4)+ln(0.8)+ln(0.6)+ln( 0.2) = -3.48
MGM workshop. 19 Oct 2010 Hidden Markov Models Assumptions: Observations are ordered Random process can be represented by a stochastic finite state machine with emitting states Probabilistic parameters of a Hidden Markov Model x – states, y – possible observations a – state transition probabilities, b –output/emision probabilities
MGM workshop. 19 Oct 2010 HMM estimation, usage & applications Training/Estimation Feed an architecture (given in advance) a set of observation sequences The training process will iteratively alter its parameters to fit the training set The trained model will assign the training sequences high probabilities Usage Evaluate the probability of an observation sequence given the model (Forward) Find the most likely path through the model for a given observation sequence (Viterbi) Applications Gene finding Protein family modeling …
MGM workshop. 19 Oct 2010 Profile HMMs Families of functional biological sequences Primary sequences have diverged due to evolution, while maintaining structure/function. Questions: Does a biological sequence belong to a certain protein family? For example is a given protein (sequence) a globin? Given a set of sequences, find more sequences of the same family
MGM workshop. 19 Oct 2010
Trade offs AdvandagesDisadvandages Statistics Modularity Transparency Prior knowledge State independence Over – fitting Local maximums Speed
MGM workshop. 19 Oct 2010 Questions?