High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

JYC: CSM17 BioinformaticsCSM17 Week 10: Summary, Conclusions, The Future.....? Bioinformatics is –the study of living systems –with respect to representation,
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
August 19, 2002Slide 1 Bioinformatics at Virginia Tech David Bevan (BCHM) Lenwood S. Heath (CS) Ruth Grene (PPWS) Layne Watson (CS) Chris North (CS) Naren.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) Tel ,
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics: A New Frontier for Computer Scientists Ruth G. Alscher Lenwood S. Heath.
Structural bioinformatics
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Sequence analysis course Lecture 8 Sequence databank searching 1.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Using Bioinformatics to Make the Bio- Math Connection The Confessions of a Biology Teacher.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
Lecture 1 BNFO 240 Usman Roshan. Course overview Perl progamming language (and some Unix basics) Sequence alignment problem –Algorithm for exact pairwise.
Introduction to BioInformatics GCB/CIS535
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Master Course Sequence Alignment Lecture 10 Database searching Issues (1)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Protein Structures.
Ayesha Masrur Khan Spring Course Outline Introduction to Bioinformatics Definition of Bioinformatics and Related Fields Earliest Bioinformatics.
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Ch10. Intermolecular Interactions and Biological Pathways
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Bioinformatics.
Bioinformatics Timothy Ketcham Union College Gradutate Seminar 2003 Bioinformatics.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
CSE 6406: Bioinformatics Algorithms. Course Outline
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Sequencing a genome and Basic Sequence Alignment
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Centre for Integrative Bioinformatics VU (IBIVU) Tel ,
Genomics and Arabidopsis. What is ‘genomics’? Study of an organism’s entire genome –All the DNA encoded in the organism –Nucleus, mitochondria, chloroplasts.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Overview of Bioinformatics 1 Module Denis Manley..
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Proteomics Session 1 Introduction. Some basic concepts in biology and biochemistry.
Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.
Central dogma: the story of life RNA DNA Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution Introduction to bioinformatics 2005 Lecture 3.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
Lecture 1 BNFO 601 Usman Roshan. Course overview Perl progamming language (and some Unix basics) –Unix basics –Intro Perl exercises –Dynamic programming.
High throughput biology data management and data intensive computing drivers George Michaels.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
Genome Annotation (protein coding genes)
Bioinformatics Overview
Introduction to Bioinformatics Resources for DNA Barcoding
High-throughput Biological Data The data deluge
Genomes and Their Evolution
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Protein Structures.
Bioinformatics For MNW 2nd Year
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Presentation transcript:

High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3

Organisational: Change to larger lecture rooms: –week 8-11 ma S201 –week 8-11 wo S209 –week wo S211 –week 18 wo S209 –week vr S209 Change of language: Nederlands => English

Last lecture: Many different genomics datasets: –Genome sequencing: more than 300 species completely sequenced and data in public domain (i.e. information is freely available), virus genome can be sequenced in a day –Gene expression (microarray) data: many microarrays measured per day –Proteomics: Protein Data Bank (PDB) contains structures (on 2 Feb 2005), –Protein-protein interaction data: many databases worldwide –Metabolic pathway, regulation and signalling data, many databases worldwide

Growth in number of protein tertiary structures

The data deluge Although a lot of tertiary structural data is being produced (preceding slide), there is the SEQUENCE-STRUCTURE-FUNCTION GAP The gap between sequence data on the one hand, and structure or function data on the other, is widening rapidly: Sequence data grows much faster

High-throughput Biological Data The data deluge Hidden in all these data classes is information that reflects –existence, organization, activity, functionality …… of biological machineries at different levels in living organisms Most effectively utilising and analysing this information computationally is essential for Bioinformatics

Data issues: from data to distributed knowledge Data collection: getting the data Data representation: data standards, data normalisation ….. Data organisation and storage: database issues ….. Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associations among data patterns Data utilisation and application: from data patterns/signals to models for bio-machineries Data visualization: viewing complex data …… Data transmission: data collection, retrieval, ….. ……

Bio-Data Analysis and Data Mining Existing/emerging bio-data analysis and mining tools for –DNA sequence assembly –Genetic map construction –Sequence comparison and database searching –Gene finding –…. –Gene expression data analysis –Phylogenetic tree analysis, e.g. to infer horizontally-transferred genes –Mass spec. data analysis for protein complex characterization –…… Current mode of work: Often enough: developing ad hoc tools for each individual application

Bio-Data Analysis and Data Mining As the amount and types of data and their cross connections increase rapidly the number of analysis tools needed will go up “exponentially” –blast, blastp, blastx, blastn, … from BLAST family of tools –gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. –tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..

Bio-Data Analysis and Data Mining Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools: e.g. clustering or optimal segmentation by Dynamic Programming Developing ad hoc tools for each application (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Bio-data Analysis, Data Mining and Integrative Bioinformatics To have analysis capabilities covering a wide range of problems, we need to discover the common fundamental structures of these problems; HOWEVER in biology one size does NOT fit all… Goal is development of a data analysis infrastructure in support of Genomics and beyond

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein complexes for photosynthesis in plants

Protein folding problem VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold) Each protein sequence “knows” how to fold into its tertiary structure. We still do not understand how and why 1-step process 2-step process The 1-step process is based on a hydrophobic collapse; the 2-step process, more common in forming larger proteins, is called the framework model of folding

Protein folding: step on the way is secondary structure prediction Long history -- first widely used algorithm was by Chou and Fasman (1974) Different algorithms have been developed over the years to crack the problem: –Statistical approaches –Neural networks (first from speech recognition) –K-nearest neighbour algorithms –Support Vector machines

Algorithms in bioinformatics (recap) Sometimes the same basic algorithm can be re-used for different problems (1-method- multiple-problem) Normally, biological problems are approached by different researchers using a variety of methods (1-problem-multiple- method)

Algorithms in bioinformatics string algorithms dynamic programming machine learning (Neural Netsworks, k-Nearest Neighbour, Support Vector Machines, Genetic Algorithm,..) Markov chain models, hidden Markov models, Markov Chain Monte Carlo (MCMC) algorithms molecular mechanics, e.g. molecular dynamics, Monte Carlo, simplified force fields stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms text analysis hybrid/combinatorial techniques and more…

Sequence analysis and homology searching

Finding genes and regulatory elements There are many different regulation signals such as start, stop and skip messages hidden in the genome for each gene, but what and where are they?

Expression data

Functional genomics Monte Carlo

Protein translation

Evolution Four requirements: Template structure providing stability (DNA) Copying mechanism (meiosis) Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.) Selection: some traits lead to greater fitness of one individual relative to another. Darwin wrote “survival of the fittest” Evolution is a conservative process: the vast majority of mutations will not be selected (i.e. will not make it as they lead to worse performance or are even lethal)

Human Evolution

Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD mutation deletion

Evolution Ancestral sequence: ABCD ACCD (B C) ABD (C ø) ACCD or ACCD Pairwise Alignment AB─D A─BD true alignment mutation deletion

Consequence of evolution Notion of comparative analysis (Darwin) What you know about one species might be transferable to another, for example from mouse to human Provides a framework to do the multi-level large-scale analysis of the genomics data plethora

Flavodoxin-cheY Multiple Sequence Alignment

This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (red) and the pathway itself have occurred (yeast has one extra path in the graph) We need to be able to do automatic pathway comparison (pathway alignment)

Thinking about evolution Is the evolutionary model applicable to other systems? –Story telling in old cultures –Richard Dawkins’ book entitled A Selfish Gene talks about Memes The Genetic Algorithm (GA) is arguably the best computational optimisation strategy around, and is based entirely on Darwinian evolution