Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.

Slides:



Advertisements
Similar presentations
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Advertisements

Phylogenetics workshop: Protein sequence phylogeny week 2 Darren Soanes.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Molecular Evolution Revised 29/12/06
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Sequencing a genome and Basic Sequence Alignment Lecture 10 1Global Sequence.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Sequence analysis course Lecture 8 Sequence databank searching 1.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics and Phylogenetic Analysis
Introduction to Computational Biology Topics. Molecular Data Definition of data  DNA/RNA  Protein  Expression Basics of programming in Matlab  Vectors.
Molecular Evolution with an emphasis on substitution rates Gavin JD Smith State Key Laboratory of Emerging Infectious Diseases & Department of Microbiology.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 10 Database searching Issues (1)
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
Protein Modules An Introduction to Bioinformatics.
Genomics and bioinformatics summary 1. Gene finding: computer searches, cDNAs, ESTs, 2.Microarrays 3.Use BLAST to find homologous sequences 4.Multiple.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Protein Structures.
Sequencing a genome and Basic Sequence Alignment Lecture 8 1Global Sequence.
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
- any detectable change in DNA sequence eg. errors in DNA replication/repair - inherited ones of interest in evolutionary studies Deleterious - will be.
CSE 6406: Bioinformatics Algorithms. Course Outline
Origins and impact of constraints in evolution of gene families Boris E. Shakhnovich and Eugene V.Koonin Genome Research 2006, October 19 Stella Veretnik.
Intelligent Systems for Bioinformatics Michael J. Watts
High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
C E N T R F O I G A V B M S U 2MNW/3I/3AI/3PHAR bachelor course Introduction to Bioinformatics Lecture 1: Introduction Centre for Integrative Bioinformatics.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Cédric Notredame (08/12/2015) Molecular Evolution Cédric Notredame.
Chapter 10 Phylogenetic Basics. Similarities and divergence between biological sequences are often represented by phylogenetic trees Phylogenetics is.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Johnson - The Living World: 3rd Ed. - All Rights Reserved - McGraw Hill Companies Genomics Chapter 10 Copyright © McGraw-Hill Companies Permission required.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution Introduction to bioinformatics 2005 Lecture 3.
Evolution at the Molecular Level. Outline Evolution of genomes Evolution of genomes Review of various types and effects of mutations Review of various.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Genome Annotation (protein coding genes)
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
Evolution of gene function
Pipelines for Computational Analysis (Bioinformatics)
High-throughput Biological Data The data deluge
Biological Classification: The science of taxonomy
Genomes and Their Evolution
There are four levels of structure in proteins
KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
MULTIPLE SEQUENCE ALIGNMENT
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
Presentation transcript:

Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E

Last lecture: Many different genomics datasets: –Genome sequencing: more than 300 species completely sequenced and data in public domain (i.e. information is freely available), virus genome can be sequenced in a day –Gene expression (microarray) data: many microarrays measured per day –Proteomics: Protein Data Bank (PDB) - as of Tuesday February 07, 2006 there are Structures. –Protein-protein interaction data: many databases worldwide –Metabolic pathway, regulation and signaling data, many databases worldwide

Growth in number of protein tertiary structures

The data deluge Although a lot of tertiary structural data is being produced (preceding slide), there is the SEQUENCE-STRUCTURE-FUNCTION GAP The gap between sequence data on the one hand, and structure or function data on the other, is widening rapidly: Sequence data grows much faster

High-throughput Biological Data The data deluge Hidden in all these data classes is information that reflects –existence, organization, activity, functionality …… of biological machineries at different levels in living organisms Most effectively utilising and analysing this information computationally is essential for Bioinformatics

Data issues: from data to distributed knowledge Data collection: getting the data Data representation: data standards, data normalisation ….. Data organisation and storage: database issues ….. Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associations among data patterns Data utilisation and application: from data patterns/signals to models for bio-machineries Data visualization: viewing complex data …… Data transmission: data collection, retrieval, ….. ……

Bio-Data Analysis and Data Mining Analysis and mining tools exist and are developed for: –DNA sequence assembly –Genetic map construction –Sequence comparison and database searching –Gene finding –Gene expression data analysis –Phylogenetic tree analysis, e.g. to infer horizontally- transferred genes –Mass spectrometry data analysis for protein complex characterization –……

Bio-Data Analysis and Data Mining As the amount and types of data and their cross connections increase rapidly the number of analysis tools needed will go up “exponentially” if we do not reuse techniques –blast, blastp, blastx, blastn, … from BLAST family of tools (we will cover BLAST later) –gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. –tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..

Bio-Data Analysis and Data Mining Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools e.g. clustering or optimal segmentation by Dynamic Programming We will cover both of these techniques in later lectures

Bio-data Analysis, Data Mining and Integrative Bioinformatics To have analysis capabilities covering a wide range of problems, we need to discover the common fundamental structures of these problems; HOWEVER in biology one size does NOT fit all… An important goal of bioinformatics is development of a data analysis infrastructure in support of Genomics and beyond

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein complexes for photosynthesis in plants

Protein folding problem VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold) Each protein sequence “knows” how to fold into its tertiary structure. We still do not understand exactly how and why 1-step process 2-step process The 1-step process is based on a hydrophobic collapse; the 2-step process, more common in forming larger proteins, is called the framework model of folding

Protein folding: step on the way is secondary structure prediction Long history -- first widely used algorithm was by Chou and Fasman (1974) Different algorithms have been developed over the years to crack the problem: –Statistical approaches –Neural networks (first from speech recognition) –K-nearest neighbour algorithms –Support Vector machines

Algorithms in bioinformatics (recap) Sometimes the same basic algorithm can be re-used for different problems (1-method- multiple-problem) Normally, biological problems are approached by different researchers using a variety of methods (1-problem-multiple- method)

Algorithms in bioinformatics string algorithms dynamic programming machine learning (Neural Netsworks, k-Nearest Neighbour, Support Vector Machines, Genetic Algorithm,..) Markov chain models, hidden Markov models, Markov Chain Monte Carlo (MCMC) algorithms molecular mechanics, e.g. molecular dynamics, Monte Carlo, simplified force fields stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms text analysis hybrid/combinatorial techniques and more…

Sequence analysis and homology searching

Finding genes and regulatory elements There are many different regulation signals such as start, stop and skip messages hidden in the genome for each gene, but what and where are they?

Expression data

Functional genomics Monte Carlo

Protein translation

What is life? NASA astrobiology program: “Life is a self-sustained chemical system capable of undergoing Darwinian evolution”

Evolution Four requirements: Template structure providing stability (DNA) Copying mechanism (meiosis) Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.) Selection: some traits lead to greater fitness of one individual relative to another. Darwin wrote “survival of the fittest” Evolution is a conservative process: the vast majority of mutations will not be selected (i.e. will not make it as they lead to worse performance or are even lethal) – this is called negative (or purifying) selection

Orthology/paralogy Orthologous genes are homologous (corresponding) genes in different species Paralogous genes are homologous genes within the same species (genome)

Changing molecular sequences Mutations: changing nucleotides (‘letters’) within DNA, also called ‘point mutations’ A & G: purines, C & T/U: pyrimidines: –Transition: purine -> purine or pyrimidine -> pyrimidine –Transversion: purine -> pyrimidine or pyrimidine -> purine

Types of point mutation Synonymous mutation: mutation that does not lead to an amino acid change (where in the codon are these expected?) Non-synonymous mutation: does lead to an amino acid change –Missense mutation: one a.a replaced by other a.a –Nonsense mutation: a.a. replaced by stop codon (what happens with protein?)

Ka/Ks Ratios Ks is defined as the number of synonymous nucleotide substitutions per synonymous site Ka is defined as the number of nonsynonymous nucleotide substitutions per nonsynonymous site The Ka/Ks ratio is used to estimate the type of selection exerted on a given gene or DNA fragment Need aligned orthologous sequences to do calculate Ka/Ks ratios (we will talk about alignment later).

Ka/Ks ratios The frequency of different values of Ka/Ks for 835 mouse–rat orthologous genes. Figures on the x axis represent the middle figure of each bin; that is, the 0.05 bin collects data from 0 to 0.1

Ka/Ks ratios Three types of selection: 1. Negative (purifying) selection -> Ka/Ks < 1 2. Neutral selection (Kimura) -> Ka/Ks ~= 1 3. Positive selection -> Ka/Ks > 1

Human Evolution

Divergent Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion

Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD true alignment mutation deletion

Consequence of evolution Notion of comparative analysis (Darwin) What you know about one species might be transferable to another, for example from mouse to human Provides a framework to do the multi-level large-scale analysis of the genomics data plethora

Flavodoxin-cheY Multiple Sequence Alignment

This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (square boxes in red) and the pathway itself have occurred (yeast has one altered (‘overtaking’) path in the graph) We need to be able to do automatic pathway comparison (pathway alignment) HumanYeast

The citric-acid cycle

The citric-acid cycle Fig. 1. (a) A graphical representation of the reactions of the citric-acid cycle (CAC), including the connections with pyruvate and phosphoenolpyruvate, and the glyoxylate shunt. When there are two enzymes that are not homologous to each other but that catalyse the same reaction (non- homologous gene displacement), one is marked with a solid line and the other with a dashed line. The oxidative direction is clockwise. The enzymes with their EC numbers are as follows: 1, citrate synthase ( ); 2, aconitase ( ); 3, isocitrate dehydrogenase ( ); 4, 2-ketoglutarate dehydrogenase (solid line; and ) and 2- ketoglutarate ferredoxin oxidoreductase (dashed line; ); 5, succinyl- CoA synthetase (solid line; ) or succinyl-CoA–acetoacetate-CoA transferase (dashed line; ); 6, succinate dehydrogenase or fumarate reductase ( ); 7, fumarase ( ) class I (dashed line) and class II (solid line); 8, bacterial-type malate dehydrogenase (solid line) or archaeal-type malate dehydrogenase (dashed line) ( ); 9, isocitrate lyase ( ); 10, malate synthase ( ); 11, phosphoenolpyruvate carboxykinase ( ) or phosphoenolpyruvate carboxylase ( ); 12, malic enzyme ( or ); 13, pyruvate carboxylase or oxaloacetate decarboxylase ( ); 14, pyruvate dehydrogenase (solid line; and ) and pyruvate ferredoxin oxidoreductase (dashed line; ). M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, (1999)

The citric-acid cycle M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, (1999) b) Individual species might not have a complete CAC. This diagram shows the genes for the CAC for each unicellular species for which a genome sequence has been published, together with the phylogeny of the species. The distance-based phylogeny was constructed using the fraction of genes shared between genomes as a similarity criterion29. The major kingdoms of life are indicated in red (Archaea), blue (Bacteria) and yellow (Eukarya). Question marks represent reactions for which there is biochemical evidence in the species itself or in a related species but for which no genes could be found. Genes that lie in a single operon are shown in the same color. Genes were assumed to be located in a single operon when they were transcribed in the same direction and the stretches of non-coding DNA separating them were less than 50 nucleotides in length.

Thinking about evolution Is the evolutionary model applicable to other systems? –Story telling in old cultures –Richard Dawkins’ book entitled A Selfish Gene talks about Memes The Genetic Algorithm (GA) is arguably the best computational optimisation strategy around, and is based entirely on Darwinian evolution