Lecture 1 BNFO 601 Usman Roshan.

Slides:



Advertisements
Similar presentations
Global Sequence Alignment by Dynamic Programming.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
BNFO 602 Multiple sequence alignment Usman Roshan.
Lecture 6, Thursday April 17, 2003
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Lecture 1 BNFO 135 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Programs for comparing.
Expected accuracy sequence alignment
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
BNFO 602 Lecture 1 Usman Roshan.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
BNFO 602 Multiple sequence alignment Usman Roshan.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
BNFO 136 Sequence alignment Usman Roshan. Pairwise alignment X: ACA, Y: GACAT Match=8, mismatch=2, gap-5 ACA---ACA---ACAACA---- GACATGACATGACATG--ACAT.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Pairwise alignment Computational Genomics and Proteomics.
Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students.
Sequence comparison: Local alignment
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Evaluation of protein alignments Usman Roshan BNFO 236.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Bioinformatics Overview
The ideal approach is simultaneous alignment and tree estimation.
Sequence comparison: Local alignment
Ab initio gene prediction
BNFO 602 Lecture 2 Usman Roshan.
BNFO 136 Sequence alignment
Pairwise sequence Alignment.
BNFO 602 Lecture 2 Usman Roshan.
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Pairwise Alignment Global & local alignment
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Sequence alignment with Needleman-Wunsch
Basic Local Alignment Search Tool (BLAST)
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Lecture 1 BNFO 601 Usman Roshan

Course overview Perl progamming language (and some Unix basics) Intro Perl exercises Dynamic programming and Viterbi algorithm in Perl Sequence analysis Algorithms for exact and heuristic pairwise alignment Hidden Markov models BLAST Short read alignments Whole genome sequence alignment High performance methods for very large datasets

Overview (contd) Grade: Two 25% exams, 30% final exam, and six programming assignments (20%) Exams will cover Perl and bioinformatics algorithms Recommended Texts: Introduction to Bioinformatics Algorithms by Pavel Pevzner and Neil Jones Biological sequence analysis by Durbin et. al. Introduction to Bioinformatics by Arthur Lesk Beginning Perl for Bioinformatics by James Tisdall Introduction to Machine Learning by Ethem Alpaydin

Nothing in biology makes sense, except in the light of evolution -3 mil yrs -2 mil yrs -1 mil yrs today TAGGCCTT (Human) TAGCCCTTA (Monkey) A_C_CTT (Cat) A_CACTTC (Lion) _G_GCTT (Mouse) _GGGCTT TAGACCTT A_CACTT AAGGCTT T_GACTT AAGACTT AAGACTT T_GACTT AAGGCTT _GGGCTT TAGACCTT A_CACTT ACCTT (Cat) ACACTTC (Lion) TAGCCCTTA (Monkey) TAGGCCTT (Human) GGCTT (Mouse) _GGGCTT TAGACCTT A_CACTT AAGGCTT T_GACTT AAGACTT T_GACTT AAGGCTT AAGACTT AAGACTT

Representing DNA in a format manipulatable by computers DNA is a double-helix molecule made up of four nucleotides: Adenosine (A) Cytosine (C) Thymine (T) Guanine (G) Since A (adenosine) always pairs with T (thymine) and C (cytosine) always pairs with G (guanine) knowing only one side of the ladder is enough We represent DNA as a sequence of letters where each letter could be A,C,G, or T. For example, for the helix shown here we would represent this as CAGT.

Transcription and translation

Amino acids Proteins are chains of amino acids. There are twenty different amino acids that chain in different ways to form different proteins. For example, FLLVALCCRFGH (this is how we could store it in a file) This sequence of amino acids folds to form a 3-D structure

Protein folding

Protein folding The protein folding problem is to determine the 3-D protein structure from the sequence. Experimental techniques are very expensive. Computational are cheap but difficult to solve. By comparing sequences we can deduce the evolutionary conserved portions which are also functional (most of the time).

Protein structure Primary structure: sequence of amino acids. Secondary structure: parts of the chain organizes itself into alpha helices, beta sheets, and coils. Helices and sheets are usually evolutionarily conserved and can aid sequence alignment. Tertiary structure: 3-D structure of entire chain Quaternary structure: Complex of several chains

Key points DNA can be represented as strings consisting of four letters: A, C, G, and T. They could be very long, e.g. thousands and even millions of letters Proteins are also represented as strings of 20 letters (each letter is an amino acid). Their 3-D structure determines the function to a large extent.

Comparison of sequences Fundamental in bioinformatics Many applications Evolutionary conserved regions Functional sites in proteins Phylogeny reconstruction Protein structural contact prediction Gene splicing Genome assembly Database search

Pairwise sequence alignment Comparison of two sequences We want to determine the substitutions, insertions, and deletions that occurred between the two sequences

Pairwise alignment Definition of a sequence alignment

Pairwise alignment

Pairwise alignment X: ACA, Y: GACAT Match=8, mismatch=2, gap-5 ACA-- -ACA- --ACA ACA---- GACAT GACAT GACAT G—ACAT 2+2+2-5-5 -5+8+8+8-5 -5-5+2+2+2 2-5-5-5-5-5-5 Score =-4 14 -4 -28

Traceback We can compute an alignment of DNA (or protein or RNA) sequences X and Y with a traceback matrix T. Sequence X is aligned along the rows and Y along the columns. Each entry of the matrix T contains D, L, or U specifying diagonal, left or upper

Traceback X: ACA, Y=TACAG T A C G L U D

Traceback X: ACA, Y=TACAG T A C G L U D

Traceback pseudocode aligned_seq1 = "" aligned_seq2 = "" i = len(seq1) j = len(seq2) while(i !=0 or j != 0): if(T[i][j] == “L”): aligned_seq1 = “-” + aligned_seq1 aligned_seq2 = seq2[j-1] + aligned_seq2 j = j - 1 elif(T[i][j] == "U"): aligned_seq2 = "-" + aligned_seq2 aligned_seq1 = seq1[i-1] + aligned_seq1 i = i - 1 else:

Optimal alignment An alignment can be specified by the traceback matrix. How do we determine the traceback for the highest scoring alignment? Needleman-Wunsch algorithm for global alignment First proposed in 1970 Widely used in genomics/bioinformatics Dynamic programming algorithm

Needleman-Wunsch Input: Notation: X = x1x2…xn, Y=y1y2…ym (X is seq1 and Y is seq2) Notation: X1..i = x1x2…xi V(X1..i,Y1..j) = Optimal alignment score of sequences X1..i and Y1..j. Suppose we know the optimal alignment scores of X1…i-1 and Y1…j-1 X1…i and Y1...j-1 X1...i-1 and Y1…j

Needleman-Wunsch Then the optimal alignment score of X1…i and Y1…j is the maximum of V(X1…i-1,Y1…j-1) + match/mismatch V (X1…i,Y1…j-1) + gap V (X1…i-1,Y1…j) + gap We build on this observation to compute V(X1...n,Y1...m)

Needleman-Wunsch Define V to be a two dimensional matrix with len(X)+1 rows and len(Y)+1 columns Let V[i][j] be the score of the optimal alignment of X1…i and Y1…j. Let m be the match cost, mm be mismatch, and g be the gap cost.

NW pseudocode Initialization: for i = 1 to length of seq1 { V[i][0] = i*g; } For i = 1 to length of seq2 { V[0][i] = i*g; } Recurrence: for i = 1 to length of seq1{ for j = 1 to length of seq2{ V[i-1][j-1] + m(or mm) V[i][j] = max { V[i-1][j] + g V[i][j-1] + g if(maximum is V[i-1][j-1] + m(or mm)) then T[i][j] = ‘D’ else if (maximum is V[i-1][j] + g) then T[i][j] = ‘U’ else then T[i][j] = ‘L’ }

NW example V G A C A T A C T Input: seq1: ACA seq2: GACAT m = 5 mm = -4 gap = -20 seq1 is lined along the rows and seq2 is along the columns G A C A T -20 -40 -60 -80 -100 -4 -15 -35 -55 -75 -24 -8 -10 -30 -50 -44 -19 -12 -5 -25 A C T L U D

NW parameters How do we pick gap penalty and match/mismatch costs? Proteins/RNA both have 3-D structure. We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

Structural alignment - example 2 Taken from http://bioinfo3d.cs.tau.ac.il/Align/FlexProt/flexprot.html Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Computer generated aligned proteins

Benchmark alignments Protein alignment benchmarks BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. Proteins benchmarks are generally large and have been in the research community for sometime now. BAliBASE 3.0

NW parameters With a procedure called cross-validation we use the benchmark to train gap parameters so that the alignment program produces correct alignments. How can we quantify the “correctness” of a test alignment with respect to a reference? We measure the number of pairs in the test that are also aligned in the reference alignment. This is also called the sum-of-pairs accuracy.

NW parameters For match and mismatch costs we use a more direct approach: log likelihood scores We calculate probabilities from the benchmark The same cannot be done for gap penalties because in benchmarks because of uncertainties in gap placements

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

Computing scoring matrices Start with a set of reference alignments Suppose we want to compute the score of A aligning to C Count the number of times A aligns to C Count the number of A’s and C’s Compute pAC the probability of A aligning to C and pA and pC the background probabilities of A and C Compute the log likelihood ratio

Next week Basics of Unix Perl programming Basics Exercises Dynamic programming solution to sequence alignment in Perl