Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Introduction to Bioinformatics
BNFO 602 Multiple sequence alignment Usman Roshan.
CIS786, Lecture 7 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Lecture 1 BNFO 135 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Programs for comparing.
Expected accuracy sequence alignment
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
BNFO 602, Lecture 2 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)
BNFO 602 Lecture 2 Usman Roshan. Sequence Alignment Widely used in bioinformatics Proteins and genes are of different lengths due to error in sequencing.
BNFO 602 Lecture 1 Usman Roshan.
BNFO 240 Usman Roshan. Last time Traceback for alignment How to select the gap penalties? Benchmark alignments –Structural superimposition –BAliBASE.
BNFO 602, Lecture 3 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Sequence alignment.
Sequence similarity.
Lecture 4 BNFO 235 Usman Roshan. IUPAC Nucleic Acid symbols.
Sequence Alignment III CIS 667 February 10, 2004.
BNFO 602 Lecture 2 Usman Roshan. Bioinformatics problems Sequence alignment: oldest and still actively studied Genome-wide association studies: new problem,
BNFO 602 Multiple sequence alignment Usman Roshan.
BNFO 235 Lecture 5 Usman Roshan. What we have done to date Basic Perl –Data types: numbers, strings, arrays, and hashes –Control structures: If-else,
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
CIS786, Lecture 6 Usman Roshan Some of the slides are based upon material by David Wishart of University.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Lecture 1 BNFO 136 Usman Roshan. Course overview Pre-req: BNFO 135 or approval of instructor Python progamming language and Perl for continuing students.
Introduction to Biological Sequences. Background: What is DNA? Deoxyribonucleic acid Blueprint that carries genetic information from one generation to.
Sequencing a genome and Basic Sequence Alignment
Exploration Session Week 8: Computational Biology Melissa Winstanley: (based on slides by Martin Tompa,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
An Introduction to Bioinformatics
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
DNA alphabet DNA is the principal constituent of the genome. It may be regarded as a complex set of instructions for creating an organism. Four different.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
Bioinformatics Computing 1 CMP 807 – Day 1 Kevin Galens.
Overview of Bioinformatics 1 Module Denis Manley..
Introduction to Bioinformatics Algorithms Algorithms for Molecular Biology CSCI Elizabeth White
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Expected accuracy sequence alignment Usman Roshan.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Evaluation of protein alignments Usman Roshan BNFO 236.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Bioinformatics Overview
Lecture 1 BNFO 601 Usman Roshan.
Ab initio gene prediction
BNFO 602 Lecture 2 Usman Roshan.
BNFO 602 Lecture 2 Usman Roshan.
Protein Structures.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Lecture 1 BNFO 601 Usman Roshan

Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming and Viterbi algorithm in Perl Sequence analysis –Algorithms for exact and heuristic pairwise alignment –Hidden Markov models –BLAST –Short read and genome alignments Genome-wide association studies (if time permits)

Overview (contd) Grade: Two 25% exams, 30% final exam, and three programming assignments (20%) Exams will cover Perl and bioinformatics algorithms Recommended Texts: –Introduction to Bioinformatics Algorithms by Pavel Pevzner and Neil Jones –Biological sequence analysis by Durbin et. al. –Introduction to Bioinformatics by Arthur Lesk –Beginning Perl for Bioinformatics by James Tisdall –Introduction to Machine Learning by Ethem Alpaydin

Nothing in biology makes sense, except in the light of evolution AAGACTT -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT T_GACTTAAGGCTT _GGGCTTTAGACCTTA_CACTT ACCTT (Cat) ACACTTC (Lion) TAGCCCTTA (Monkey) TAGGCCTT (Human) GGCTT (Mouse) T_GACTTAAGGCTT AAGACTT _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT TAGGCCTT (Human) TAGCCCTTA (Monkey) A_C_CTT (Cat) A_CACTTC (Lion) _G_GCTT (Mouse) _GGGCTTTAGACCTTA_CACTT AAGGCTTT_GACTT AAGACTT

Representing DNA in a format manipulatable by computers DNA is a double-helix molecule made up of four nucleotides: –Adenosine (A) –Cytosine (C) –Thymine (T) –Guanine (G) Since A (adenosine) always pairs with T (thymine) and C (cytosine) always pairs with G (guanine) knowing only one side of the ladder is enough We represent DNA as a sequence of letters where each letter could be A,C,G, or T. For example, for the helix shown here we would represent this as CAGT.

Transcription and translation

Amino acids Proteins are chains of amino acids. There are twenty different amino acids that chain in different ways to form different proteins. For example, FLLVALCCRFGH (this is how we could store it in a file) This sequence of amino acids folds to form a 3-D structure

Protein folding

The protein folding problem is to determine the 3-D protein structure from the sequence. Experimental techniques are very expensive. Computational are cheap but difficult to solve. By comparing sequences we can deduce the evolutionary conserved portions which are also functional (most of the time).

Protein structure Primary structure: sequence of amino acids. Secondary structure: parts of the chain organizes itself into alpha helices, beta sheets, and coils. Helices and sheets are usually evolutionarily conserved and can aid sequence alignment. Tertiary structure: 3-D structure of entire chain Quaternary structure: Complex of several chains

Key points DNA can be represented as strings consisting of four letters: A, C, G, and T. They could be very long, e.g. thousands and even millions of letters Proteins are also represented as strings of 20 letters (each letter is an amino acid). Their 3-D structure determines the function to a large extent.

Comparison of sequences Fundamental in bioinformatics Many applications –Evolutionary conserved regions –Functional sites in proteins –Phylogeny reconstruction –Protein structural contact prediction –Gene splicing –Genome assembly –Database search

Pairwise sequence alignment Comparison of two sequences We want to determine the substitutions, insertions, and deletions that occurred between the two sequences

Pairwise alignment How to align two sequences? We use dynamic programming Treat DNA sequences as strings over the alphabet {A, C, G, T}

Pairwise alignment

Dynamic programming Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

Dynamic programming Time and space complexity is O(mn) Define V(i,j) to be the optimal pairwise alignment score between S 1..i and T 1..j (|S|=m, |T|=n)

How do we understand this dynamic programming algorithm? Let’s first look at some example alignments Let’s look at gaps. How do we know where to insert gaps Let’s look at the structure of an optimal alignment of two sequences x and y and how it relates optimal alignments of subsequences of x and y

How do we pick gap parameters?

Structural alignments Recall that proteins have 3-D structure.

Structural alignment - example 1 Alignment of thioredoxins from human and fly taken from the Wikipedia website. This protein is found in nearly all organisms and is essential for mammals. PDB ids are 3TRX and 1XWC.

Structural alignment - example 2 Computer generated aligned proteins Unaligned proteins. 2bbm and 1top are proteins from fly and chicken respectively. Taken from

Structural alignments We can produce high quality manual alignments by hand if the structure is available. These alignments can then serve as a benchmark to train gap parameters so that the alignment program produces correct alignments.

Benchmark alignments Protein alignment benchmarks –BAliBASE, SABMARK, PREFAB, HOMSTRAD are frequently used in studies for protein alignment. –Proteins benchmarks are generally large and have been in the research community for sometime now. –BAliBASE 3.0BAliBASE 3.0

Comparison of two alignments How can we quantify the “correctness” of a test alignment with respect to a reference? We measure the number of pairs in the test that are also aligned in the reference alignment. This is also called the sum-of-pairs accuracy.

Biologically realistic scoring matrices PAM and BLOSUM are most popular PAM was developed by Margaret Dayhoff and co-workers in 1978 by examining 1572 mutations between 71 families of closely related proteins BLOSUM is more recent and computed from blocks of sequences with sufficient similarity

Computing scoring matrices Start with a set of reference alignments Suppose we want to compute the score of A aligning to C Count the number of times A aligns to C Count the number of A’s and C’s Compute p AC the probability of A aligning to C and p A and p C the background probabilities of A and C Compute the log likelihood ratio

Next week Basics of Unix Perl programming –Basics –Exercises –Dynamic programming solution to sequence alignment in Perl