Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Profile-profile alignment using hidden Markov models Wing Wong.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Introduction to BioInformatics GCB/CIS535
Bio 465 Summary. Overview Conserved DNA Conserved DNA Drug Targets, TreeSAAP Drug Targets, TreeSAAP Next Generation Sequencing Next Generation Sequencing.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Class Projects. Future Work and Possible Project Topic in Gene Regulatory network Learning from multiple data sources; Learning causality in Motifs; Learning.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Epistasis Analysis Using Microarrays Chris Workman.
CISC667, F05, Lec27, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Review Session.
Protein Classification A comparison of function inference techniques.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
JM - 1 Introduction to Bioinformatics: Lecture VIII Classification and Supervised Learning Jarek Meller Jarek Meller Division.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Automatic methods for functional annotation of sequences Petri Törönen.
Whole Genome Expression Analysis
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Problem Statement and Motivation Key Achievements and Future Goals Technical Approach Investigators: Yang Dai Prime Grant Support: NSF High-throughput.
From motif search to gene expression analysis
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Finish up array applications Move on to proteomics Protein microarrays.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
Biological Networks. Can a biologist fix a radio? Lazebnik, Cancer Cell, 2002.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
1 CISC 841 Bioinformatics (Fall 2007) Kernel engineering and applications of SVMs.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
An overview of Bioinformatics. Cell and Central Dogma.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
InterPro Sandra Orchard.
Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods > Capable of incorporating domain knowledge > Effective,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
1 CISC 841 Bioinformatics (Fall 2008) Review Session.
CISC667, S07, Lec25, Liao1 CISC 467/667 Intro to Bioinformatics (Spring 2007) Review Session.
High-throughput Biological Data The data deluge
Combining HMMs with SVMs
Genomes and Their Evolution
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (IV)
Profile HMMs GeneScan TMMOD
Mod. Reg. Data set Correct topology location Sens- itivity Speci- ficity TMMOD 1 (a) (b) (c) S (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%)
Presentation transcript:

Bioinformatics Research Overview Li Liao Develop new algorithms and (statistical) learning methods that help solve biological problems > Capable of incorporating domain knowledge > Effective, Expressive, Interpretable 1Li Liao, SIG NewGrad, 09/29/2008

Motivations Understanding correlations between genotype and phenotype Predicting genotype phenotype Some Phenotype examples: –Protein function –Drug/therapy response –Drug-drug interactions for expression –Drug mechanism –Interacting pathways of metabolism 2Li Liao, SIG NewGrad, 09/29/2008

Bioinformatics in a … cell 3Li Liao, SIG NewGrad, 09/29/2008

Credit:Kellis & Indyk 4Li Liao, SIG NewGrad, 09/29/2008

Projects –Genome sequencing and assembly (funded by NSF) –Homology detection, protein family classification (funded by a DuPont S&E award)  Support Vector Machines  Hidden Markov models  Graph theoretic methods –Probabilistic modeling for BioSequence (funded by NIH)  HMMs, and beyond  Motifs finding  Secondary structure –Systems Bioinformatics Prediction of Protein-Protein Interactions Inference of Gene Regulatory Networks Prediction of other regulatory elements Pattern analysis for RNAi (funded by UDRF) –Comparative Genomics  Identify genome features for diagnostic and therapeutic purposes (funded by an Army grant) 5Li Liao, SIG NewGrad, 09/29/2008

People Current members: -Dr. Wen-Zhong Wang (Postdoc Fellow) -Roger Craig (PhD student) -Alvaro Gonzalez (PhD student) -Kevin McCormick (PhD student) -Colin Kern (PhD student) Past members: -Robel Kahsay (Ph.D. currently at DuPont Central Research & Development) -Kishore Narra (M.S. currently at VistaPrint, Inc.) -Arpita Gandhi (M.S. currently at Colgate-Palmolive Company) -Gaurav Jain (M.S. currently at Institute of Genomics, Univ. of Maryland) -Shivakundan Singh Tej (M.S.) -Tapan Patel (B.S. currently in MD/PhD program at U Penn) -Laura Shankman (B.S., currently in PhD program at U Virginia) 6Li Liao, SIG NewGrad, 09/29/2008

7

8

Hybrid Hierarchical Assembly Three types of reads: Sanger (~1000bp), 454 (~100bp), and SBS (~30bp). Assembly of individual types using the best suited assemblers. –Phrap, TIGR, etc. for Sanger reads –Euler assembler and Newbler for 454 reads –Euler short, Shorty for SBS reads Hybrid and hierarchical –Use longer reads as scaffolding to resolve repeat regions that are difficult for shorter reads –Use contigs from shorter reads (pyrosequencing) as pseudoreads to bridge gaps (nonclonable and hard stops) with Sanger reads. 9Li Liao, SIG NewGrad, 09/29/2008

Major Findings Hybrid hierarchical assembly is proved to be an effective way for assembling short reads Incremental approach to selecting ABI reads is more effective than random approach in generating high coverage contigs Staged assembly using Phrap is an effective alternative to the proprietary Newbler assembler. Publications: Gonzalez & Liao, BMC Bioinformatics 2008, 9: Li Liao, SIG NewGrad, 09/29/2008

Blue lines are contigs generated from hybrid assembly 11Li Liao, SIG NewGrad, 09/29/2008

Detect remote homologues Attributes: -Sequence similarity, Aggregate statistics (e.g., protein families), Pattern/motif, and more attributes (presence at phylogenetic tree). How to incorporate domain specific knowledge into the model so a classifier can be more accurate? Results: -Quasi-consensus based comparison of profile HMM for protein sequences (Kahsay et al, Bioinformatics 2005) -Using extended phylogenetic profiles and support vector machines for protein family classification (Narra & Liao, SNPD04, Craig & Liao, ICMLA’05, Craig & Liao SAC’06, Craig & Liao, Int’l J. Bioinfo & DM 2007) -Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships (JCB 2003) 12Li Liao, SIG NewGrad, 09/29/2008

Non-linear mapping to a feature space Φ( )Φ( ) xixi Φ(xi)Φ(xi) Φ(xj)Φ(xj) xjxj L(  ) =   i  ½   i  j y i y j Φ (x i )·Φ (x j ) 13Li Liao, SIG NewGrad, 09/29/2008

= = x = y = z = Hamming distance Tree-based distance Data: phylogenetic profiles - How to account for correlations among profile components?  profile extension (Narra & Liao, SNPD 04)  Transductive learning (Craig & Liao, ICMLA’05, SAC’06, IJBDM, 2007) 14Li Liao, SIG NewGrad, 09/29/2008

Post-order traversal 15Li Liao, SIG NewGrad, 09/29/2008

16Li Liao, SIG NewGrad, 09/29/2008

Sequence Models (HMMs and beyond) Motivations: What is responsible for the function? –Patterns/motifs –Secondary structure To capture long range correlations of bio sequences –Transporter proteins –RNA secondary structure Methods: generative versus discriminative –Linear dependent processes –Stochastic grammars –Model equivalence 17Li Liao, SIG NewGrad, 09/29/2008

TMMOD: An improved hidden Markov model for predicting transmembrane topology ( Kahsay, Gao & Liao. Bioinformatics 2005 ) 18Li Liao, SIG NewGrad, 09/29/2008

Mod.Reg. Data set Correct topology Correct location Sens- itivity Speci- ficity TMMOD 1 (a) (b) (c) S (78.3%) 51 (61.4%) 64 (77.1%) 67 (80.7%) 52 (62.7%) 65 (78.3%) 97.4% 71.3% 97.1% 97.4% 71.3% 97.1% TMMOD 2 (a) (b) (c) S (73.5%) 54 (65.1%) 65 (78.3%) 61 (73.5%) 66 (79.5%) 99.4% 93.8% 99.7% 97.4% 71.3% 97.1% TMMOD 3 (a) (b) (c) S (84.3%) 64 (77.1%) 74 (89.2%) 71 (85.5%) 65 (78.3%) 74 (89.2%) 98.2% 95.3% 99.1% 97.4% 71.3% 97.1% TMHMMS (77.1%)69 (83.1%)96.2% PHDtmS-83 (85.5%) (88.0%)98.8%95.2% TMMOD 1 (a) (b) (c) S (73.1%) 92 (57.5%) 117 (73.1%) 128 (80.0%) 103 (64.4%) 126 (78.8%) 97.4% 77.4% 96.1% 97.0% 80.8% 96.7% TMMOD 2 (a) (b) (c) S (75.0%) 97 (60.6%) 118 (73.8%) 132 (82.5%) 121 (75.6%) 135 (84.4%) 98.4% 97.7% 98.4% 97.2% 95.6% 97.2% TMMOD 3 (a) (b) (c) S (75.0%) 110 (68.8%) 135 (84.4%) 133 (83.1%) 124 (77.5%) 143 (89.4%) 97.8% 94.5% 98.3% 97.6% 98.1% TMHMMS (76.9%)134 (83.8%)97.1%97.7% 19Li Liao, SIG NewGrad, 09/29/2008

20Li Liao, SIG NewGrad, 09/29/2008

21Li Liao, SIG NewGrad, 09/29/2008

22 Inferring Regulatory Networks from Time Course Expression Data (Gandhi, Cogburn & Liao, 2008) Expression Profile Clustering Binary heat map Boolean network algorithm K-mean