Current Challenges in Bioinformatics SPIRE 2003 Manaus, Brazil João Meidanis.

Slides:



Advertisements
Similar presentations
Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012.
Advertisements

A Lite Introduction to (Bioinformatics and) Comparative Genomics Chris Mueller August 10, 2004.
Introduction to Bioinformatics. What is Bioinformatics Easy Answer Using computers to solve molecular biology problems; Intersection of molecular biology.
Computational biology and computational biologists Tandy Warnow, UT-Austin Department of Computer Sciences Institute for Cellular and Molecular Biology.
Introduction to Bioinformatics
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Introduction to Bioinformatics Spring 2008 Yana Kortsarts, Computer Science Department Bob Morris, Biology Department.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Graduate Opportunities in Bioinformatics By Tristan Butterfield Alternative Career Presentation Senior Seminar,
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
1 Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
8/15/2015Bioinformatics and Computational Biology Undergraduate Major 1 Iowa State University College of Liberal Arts and Sciences Bioinformatics & Computational.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Automatic methods for functional annotation of sequences Petri Törönen.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
CSE 6406: Bioinformatics Algorithms. Course Outline
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Introduction to Bioinformatics Spring 2002 Adapted from Irit Orr Course at WIS.
Current Challenges in Bioinformatics Based on talk given at SPIRE 2003 Manaus, Brazil João Meidanis.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Sequencing a genome and Basic Sequence Alignment
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
WMU CS 6260 Parallel Computations II Spring 2013 Presentation #1 about Semester Project Feb/18/2013 Professor: Dr. de Doncker Name: Sandino Vargas Xuanyu.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
EB3233 Bioinformatics Introduction to Bioinformatics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Bioinformatics and Computational Biology
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Bioinformatics: Cool stuff you can do with Computers and Biology Oded Magger Tel Aviv University / Autodesk inc. GIP course 2010.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Artificial Intelligence: Research and Collaborative Possibilities a presentation by: Dr. Ernest L. McDuffie, Assistant Professor Department of Computer.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS) LECTURE 13 ANALYSIS OF THE TRANSCRIPTOME.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Graduate Research with Bioinformatics Research Mentors Nancy Warter-Perez, ECE Robert Vellanoweth Chem and Biochem Fellow Sean Caonguyen 8/20/08.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
BME435 BIOINFORMATICS.
Human Genome Project.
Statistical Testing with Genes
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Genome organization and Bioinformatics
9 Future Challenges for Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Gene Expression Analysis
Introduction to Bioinformatics
Presentation transcript:

Current Challenges in Bioinformatics SPIRE 2003 Manaus, Brazil João Meidanis

Summary Introduction Current Challenges in Bioinformatics –As seen by Bioscientists –As seen by Computer Scientists Broad Challenges Specialized Challenges My personal challenges –In the Academic Sector –In the Business Sector Perspectives

Dependency among knowledge areas Biology Chemistry PhysicsMathematics Statistics Computer Science Bioinformatics

Applications of Comp Sci to Biology Traditionally, number crunching applications (models for biological systems) More recently, combinatorial applications, related to DNA and protein sequences, maps, genomes, etc. Both Computer Science and Biology deal with very complex systems, e.g., software, cells

How to study complex systems Study a complex system by taking “projections” or “slices” to focus on one aspect at a time Example from CS: software: logical view, physical view, development view, etc. Example from Biology: protein: cellular compartment, biological process, molecular function (as in the Gene Ontology initiative)

Bioinformatics Challenges as seen by Biologists Collins et al view of the future of genomics after the Human Genome Project – Computational Biology plays a role Top Ten Challenges by Birney, Burge, and Fickett – as we will see, very biologically oriented

The future of Genomics Research Collins et al, Nature 422, 835 – 847 (2003) Resources Tech.development Training ELSI Education

Top Ten Challenges Birney (EBI), Burge (MIT), Fickett (GSK), Genome Technology 17, Jan 2002 Predict transcription Predict splicing Predict signal transduction Predict DNA:protein and protein:protein recognition codes Predict protein structure

Top Ten Challenges (cont.) Birney (EBI), Burge (MIT), Fickett (GSK), Genome Technology 17, Jan 2002 Design small molecule inhibitors of proteins Understand protein evolution Understand speciation Develop effective gene ontologies Develop appropriate curricula for bioinformatics education

Top Ten, Global View – Part 1

Top Ten, Global View – Part 2

Top Ten, Global View – Part 3

Bioinformatics Challenges as seen by Computer Scientists Broad Challenges –Information management, paralellism, programability Specialized challenges –Related to several problems: sequence comparison, fragment assembly and clustering, phylogenetic trees, genome rearrangements and genome comparison, micro-array technology, protein classification

Broad Challenges Information management challenge –Large sets –Semi-structured data –Experimental errors –Integration of loosely coupled data Paralellism challenge –Development of expressive control systems for heterogeneous, distributed computing Programability –Development of higher level languages –Programming is still hard and error prone

Limitations of relational databases Lack of support for hierarchies Changing the schema: all hell breaks loose Query language (SQL): it can be challenging and nonintuitive to write an efficient query

Sequence comparison Statement of the problem: to find similarities among two or more sequences, usually accompanied by an alignment, highlighting common origin and/or 3D structure Many facets of the problem are well understood: –Use of dynamic programming –Gap-open and gap-extend penalties –How to do it using linear space O(m + n) –Global, local, semi-global, etc. variants –Scoring systems for DNA and protein sequences (e.g., BLOSUM matrices)

Sequence comparison But challenges still remain: –How to compare very long sequences, e.g., genomes, avoiding the mosaic effect (good regions interspersed with bad regions) –One possibility is the use of normalized alignments, where a minimum score per position ratio has to be maintained (Arslan et al, Bioinformatics 17: , 2001) –How to compare genomic DNA to cDNA sequences –Multiple sequence alignment

Fragment assembly Statement of the problem: correctly recontruct a genome (or piece of a genome) from fragments, i.e., contiguous substrings of lenght ~700 Facets of the problem that are well understood: –Overlap-layout-consensus strategy, its strenghts and limitations

Fragment assembly Challenges: –How to deal with repeats –How to use mated pairs and scaffolds –Strong dependency on thorough data clean-up –Sequencing by hibridization: will it ever be a viable alternative? –The Eulerian method: new approach that has not been extensively tested in a production setting (Pevzner et al, PNAS 98(17): describes the approach)

EST Clustering Statement of the problem: given many samples of mRNAs from the same organism or from closely related organisms, group together in clusters those mRNAs that are related Techniques used are similar to those for fragment assembly, but goals are different

EST Clustering Challenges: –Intended meaning for the cluster: transcript, gene, or gene family –How to deal with alternative splicing –Strong dependency on thorough data clean-up [Silva and Telles, Genetics and Molecular Biology 24(1- 4):17-23, 2001 is a good example of thorough clean-up] –Recognition of chimeric clones and clusters –Separation of paralogs

Physical Mapping Statement of the problem: position large, contiguous pieces of a genome in their correct relative location Used to be an intermediate step before complete sequencing of a genome Now people tend to sequence directly, without mapping first Two versions of the problem: –Data coming from digestion experiments –Data coming from hybridization experiments Recent developments –PQR trees in almost linear time (Meidanis and Telles, 2003, in preparation)

Phylogenetic trees Statement of the problem: construct a tree structure showing the evolution of a group of species from a common ancestor Old problem: construction of phylogenetic trees was done using macroscopic characteristics of species before the genomic era The area gained momentum with molecular data: differences at the molecular level can be used as characteristics It is possible to use distance data originated from sequence comparison as well Challenges (just one example): –Consensus trees

Genome rearrangements Statement of the problem: given two genomes with the same genes, find the minimum number of rearrangement events that lead from one genome to the other A crucial observation is that sometimes gene order evolves faster than gene sequence, e.g. in plant mitochondria (Palmer and Herbon, J. Molecular Evolution 28:87--97, 1988) Possible rearrangement events: reversal, transposition, translocation, fission, fusion, etc.

Genome rearrangements The problem was given this precise mathematical formulation recently Challenges: –To solve the transposition distance problem –Combine several events –How to deal with gene duplication, gene creation and gene loss (nonconservative comparison) –How to compare multiple genomes under rearrangement events

Micro-array Analysis Micro-array experiments are one way of measuring the expression pattern of genes, i.e., when and how often a gene is used to produce the corresponding product This is not a single bioinformatics problem, but rather requires a collection of problems to be solved in order to design the experiments, gather the results as image files, quantify and normalize the images, and analyze the expression patterns It is receiving a tremendous amount of attention

Micro-array Analysis Requires strong statistical background Challenges: –Steps to take to guarantee the reproducibility of results (MIAME - Minumum information about a micro-array experiment - iniciative) –Clustering algorithms: lots of alternatives, which is the best? (Datta and Datta, Bioinformatics 19: , 2003) –Data acquisition from images –Development of benchmarks (Spellman et al, Molecular Biology of the Cell 9: , 1998 presented a very influential benchmark set)

Protein Classification Statement of the problem: given the sequence of a protein, classify it according to some predefined categorization, usually hierarchical The goal is to predict protein function There is a huge amount of sequences waiting to be classified Challenges –Development of automatic classification methods –Sequence comparison alone is not sufficient – sequence databases such as GenBank are full of erroneous annotations done by similarity

Specialized Challenges – Global View

My Personal Challenges At the University of Campinas Past challenges –Bioinformatics support for the sequencing of Xylella fastidiosa, the first plant pathogen sequenced worldwide –Bioinformatics support for the sequencing of sugarcane (EST project) –Bioinformatics support for the sequencing of human cancer tissue (EST project) –Bioinformatics support for the sequencing of two species of Xanthomonas –Advisor of 5 Masters theses and 3 PhD dissertations in Bioinformatics

My Personal Challenges At the University of Campinas Current challenges –Solving the transposition distance problem in genome rearrangements –Solving a related, seemingly easier version: the prefix transposition problem –Using integer programming (IP) to attack problems of unknown complexity. The rationale is: when the problem is easy, IP will solve it fast consistently –Developing mathematical models for biologically relevant objects, e.g., interval graphs with repeats to model DNA with repeats –Using permutation group theory, in particular a new, divisibility theory, to attack genome rearrangement problems

My Personal Challenges At Scylla Bioinformatics Past challenges –Construction of a web-based system for support of distributed sequencing projects, a complete redesign of the system we had at Unicamp –Construction of a client-server system for discovery and analysis of single nucleotide polymorphisms (SNPs) based on DNA sequencing –Construction of a web-based system for finding Simple Sequence Repeats (SSR)

My Personal Challenges At Scylla Bioinformatics Current challenges –Building top quality software in terms of reliability, performance, ease of use, data security –Organizing a sound, effective software development process –Fostering the development of the biotechnology market in Brazil and Latin America –Building value in terms of software products, intellectual property, and organizational processes –Construct and maintain a team of highly qualified, motivated individuals around the preceeding goals

Perspectives The future of the area will likely include: –Formation of larger, interdisciplinary groups –Bioscientists and Computer Scientists increasingly understanding both fields –Probability and statistics playing an important role –Increased quantification –Construction of benchmarks