De Bruijn NG (1946). A Combinatorial Problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49: 758–764. Eddy SR (2009). A new generation of homology.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

 Data mining has emerged as a critical tool for knowledge discovery in large data sets. It has been extensively used to analyze business, financial,
Hidden Markov Model in Biological Sequence Analysis – Part 2
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Introduction to metagenomics Agnieszka S. Juncker Center for Biological Sequence Analysis Technical University of Denmark.
Metagenomics Binning and Machine Learning
Biotechnology SB2.f – Examine the use of DNA technology in forensics, medicine and agriculture.
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
C. Titus Brown Asst Prof, CSE and Micro Michigan State University
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Microarrays.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
EB3233 Bioinformatics Introduction to Bioinformatics.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Finding new nirK genes in metagenomic data
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
(H)MMs in gene prediction and similarity searches.
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
9.1 Manipulating DNA KEY CONCEPT Biotechnology relies on cutting DNA at specific places.
Center for Bioinformatics and Genomic Systems Engineering Bioinformatics, Computational and Systems Biology Research in Life Science and Agriculture.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Metagenomic Species Diversity.
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Metafast High-throughput tool for metagenome comparison
Metagenomic assembly Cedric Notredame
E. Bapteste, C. Bicep, P. Lopez  Clinical Microbiology and Infection 
Pfam: multiple sequence alignments and HMM-profiles of protein domains
1 Department of Engineering, 2 Department of Mathematics,
A Short Tutorial on Causal Network Modeling and Discovery
1 Department of Engineering, 2 Department of Mathematics,
H = -Σpi log2 pi.
1 Department of Engineering, 2 Department of Mathematics,
Ligand Docking to MHC Class I Molecules
E. Bapteste, C. Bicep, P. Lopez  Clinical Microbiology and Infection 
Fragment Assembly 7/30/2019.
Presentation transcript:

de Bruijn NG (1946). A Combinatorial Problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49: 758–764. Eddy SR (2009). A new generation of homology search tools based on probabilistic inference. Genome Informatics 23: doi: / _0019. Hart PE, Nilsson NJ, Raphael B (1968). "A Formal Basis for the Heuristic Determination of Minimum Cost Paths". IEEE Transactions on Systems Science and Cybernetics SSC4 4(2): 100–107. doi: /TSSC Yen JY (1971). Finding the K Shortest Loopless Paths in a Network. Management Science Theory Series (July) 17(11): Published by: INFORMS. Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT (in press). Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. REFERENCES Fig. 2: Combined Graph (CG) GLBRC DATASETS  Two miscanthus, two switchgrass samples  500M 100BP reads in each sample  Assembled separately with Xander searching for nifH  Examined nifH group composition in the four samples 1 Center for Microbial Ecology; 2 Department of Computer Science and Engineering; 3 Department of Microbiology and Molecular Genetics Michigan State University, East Lansing, MI Contact: Jordan A. Fish 1, Qiong Wang 1, Yanni Sun 2, C. Titus Brown 2,3, James M. Tiedje 1,3 and James R. Cole 1 Very large metagenomes tax the abilities of current-generation short-read assemblers. In addition to space and time complexity issues, most assemblers are not designed to correctly treat reads from closely related populations of organisms. We are developing a gene-targeted approach for metagenome assembly. In this approach, information about specific genes is used to guide assembly, and gene annotation occurs concomitantly with assembly. This approach combines a space-efficient De Bruijn graphical representation of the reads with a protein profile Hidden Markov Model for the gene(s) of interest. To limit the search, we use a heuristic to first identify nucleotide k-mers that translate to peptides found in a set of representatives of the target protein family. These k-mers, along with the positions of the peptides in the HMM representation, define a set of search start points. Contigs are then assembled by applying graph path-finding algorithms in both directions on the combined De Bruijn-HMM graph structure. Using this technique, we have been able to extract complete nifH protein coding regions from several 50G soil metagenomes, including metagenomes from an Iowa great prairie soil and soils planted with Miscanthus and Switchgrass, two potential biofuel crops. In addition, we have extracted complete but genes coding for butyryl-CoA transferase from human gut metagenomes. Future work will focus on separating sequencing artifacts from low-coverage rare populations. INTRODUCTION METHODS De Bruijn transitions Combined graph transitions HMM states This work was funded in part by the DOE Great Lakes Bioenergy Research Center (DOE BER Office of Science DE-FC02-07ER64494, DOE OBP Office of Energy Efficiency and Renewable Energy DE-AC05-76RL01830), the Great Prairie Soil Metagenomes Project sponsored by DOE’s Joint Genome Institute (piloting for DOE’s Grand Challenge Program), and NIH/PHS Human Microbiome Project (The Role of Gut Microbiota in Ulcerative Colitis grant UH3-DK ). Xander is a De Bruijn Graph assembler designed for gene targeted metagenomic assembly. We use a space efficient graph representation to enable scaling to large datasets. Xander is a local assembly tool; starting from a node in the graph, we walk in each direction using a Hidden Markov Model as a guide to assemble genes of interest. In order to explore population level diversity we have developed methods to find additional, sub-optimal, paths. PER-GENE PREPARATION  Select high-quality representative sequences  Build Forward and Reverse HMMs  Select reference set of known protein sequences SUB-OPTIMAL PATHS In order to capture the population level diversity in metagenomic samples we implemented a modified version of Yen’s K-Shortest Path Algorithm. Yen’s algorithm will find the K-Shortest paths, even if those paths contain all the same nodes. However, we are interested in paths that contain new nodes. Once we have the K-Shortest paths, we extract the subgraph induced by the nodes contained in the K paths. SEARCHING  The De Bruijn Graph and HMM are combined on the fly to create a graph where nodes represent both a k- mer from the De Bruijn graph and HMM state (position in the model and match/insert/delete state).  The edges represent both transitions between k-mers in the De Bruijn graph and between positions in the HMM model. Edges are weighted with transition and emission probabilities from the HMM.  We find the best path from each starting node using the A* search algorithm, using the probability of the most probable path from the current node as the heuristic value function. Fig. 3: nifH groups present in miscanthus and switchgrass samples Total 2,780 Group 1 RESULTS ACKNOWLEDGMENTS Miscanthus #1 Miscanthus #2 Switchgrass #1 Switchgrass #2 Fig. 1: Xander Pipeline GENE: but (butyryl-CoA transferase) Butyrate serves as the major energy source of colonocytes, has anti- inflammatory properties, and regulates gene expression, differentiation and apoptosis in host cells. In healthy individuals the but pathway is the major pathway for butyrate production in human gut. RESULTS Xander searched and assembled 56 unique protein sequences with length >100. Only two nearly identical sequences were full length. These were very similar (2 and 4 AA substitutions) to a but gene from the HMP reference genome sequence of Acidaminococcus sp. D21, isolated from a healthy human gut. HMP DATASET 100M 101-bp reads, 15G metagenomic shotgun Human Gut data from an ulcerative colitis (UC) patient who underwent a colectomy followed by ileal pouch anal anastomosis. In this procedure, the entire colon is resected, the terminal ileum is fashioned into a pouch, connected to the anal canal and the intestinal flow is re-established. Sequence Nucleotide Substitutions AA Substitutions V I 194 Q P V I 139 V G 141 A S 194 Q P Table: Substitutions found in the two full-length but sequences assembled by Xander MOCK DATASET  Gene: Azospirillum brasilense Sp245  Made mock reads using BioGrinder: 100BP-long reads, simulated Illumina errors, targeted 10x coverage of the genome  Assembled with Xander searching for nifH  One k-mer selected for sub-optimal path extraction  Examined k-mer coverage of known nifH k-mers >7 Occurrences 2-7 Occurrences 1 Occurrence Fig. 4: Small portion of subgraph induced by top 100 paths from one start point