Algorithms for Multisample Read Binning

Slides:

Advertisements

Similar presentations

CS 336 March 19, 2012 Tandy Warnow.

Advertisements

Marius Nicolae Computer Science and Engineering Department

Graph Theory Aiding DNA Fragment Assembly Jonathan Kaptcianos advisor: Professor Jo Ellis-Monaghan Work.

RNA-Seq based discovery and reconstruction of unannotated transcripts

RIP – T RANSCRIPT E XPRESSION L EVELS. O UTLINE RNA Immuno-Precipitation (RIP) NGS on RIP & its alternatives Alternate splicing Transcription as a graph.

Fast Algorithms For Hierarchical Range Histogram Constructions

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.

Next Generation Sequencing, Assembly, and Alignment Methods

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Lecture 12 Splicing and gene prediction in eukaryotes

Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Metagenomics Binning and Machine Learning

De-novo Assembly Day 4.

CS 394C March 19, 2012 Tandy Warnow.

Todd J. Treangen, Steven L. Salzberg

1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.

Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,

Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.

Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)

Metagenomics Assembly Hubert DENISE

The iPlant Collaborative

Chapter 21 Eukaryotic Genome Sequences

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

RNA-Seq Assembly 转录组拼接唐海宝基因组与生物技术研究中心 2013 年 11 月 23 日.

Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.

Identification of Copy Number Variants using Genome Graphs

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

billion-piece genome puzzle

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

The iPlant Collaborative

CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.

Sequencing technologies and Velvet assembly Lecturer ： Du Shengyang September 29 ， 2012.

COMPUTATIONAL GENOMICS GENOME ASSEMBLY

A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.

An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.

ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads

CSE280Stefano/Hossein Project: Primer design for cancer genomics.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.

RNA Sequencing and transcriptome reconstruction Manfred G. Grabherr.

RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Metagenomic Species Diversity.

RNA Quantitation from RNAseq Data

Assembly algorithms for next-generation sequencing data

Metafast High-throughput tool for metagenome comparison

Research in Computational Molecular Biology , Vol (2008)

Kallisto: near-optimal RNA seq quantification tool

Distributed Memory Partitioning of High-Throughput Sequencing Datasets for Enabling Parallel Genomics Analyses Nagakishore Jammula, Sriram P. Chockalingam,

CS 598AGB Genome Assembly Tandy Warnow.

Reference based assembly

Alternative Splicing QTLs in European and African Populations

Inference of alternative splicing from RNA-Seq data with probabilistic splice graphs BMI/CS Spring 2019 Colin Dewey

Sequence Analysis - RNA-Seq 2

Schematic representation of a transcriptomic evaluation approach.

Fragment Assembly 7/30/2019.

Genome resolved metagenomics

Presentation transcript:

Algorithms for Multisample Read Binning Student: Gabriel Ilie Major advisor: Ion Măndoiu Associate Advisor: Sanguthevar Rajasekaran Associate Advisor: Yufeng Wu University of Connecticut November 2013

Outline Motivation Previous approaches Algorithms Results Ongoing work

Human microbiome The microorganisms inhabiting our bodies comprise the microbiome. Microbes outnumber our own cells by 10 to 1 [Wooley et al, 2010]. Diabetes, obesity, cancer and even attractiveness to mosquitoes seem to correlate with changes in our microbiome [Turnbaugh et al, 2010]. To better understand our own condition we also need to understand the composition of the microbial communities inhabiting our bodies and how they interact not just between themselves but also with their habitat (e.g. the host organism).

Single organism studies they rely on clonal cultures most microorganism cannot be cultivated suffer from amplification bias microbes do not live in single species communities members of these communities interact not just with each other but also with their habitats, which includes the host organism Data obtained from clonal cultures is highly biased and does not capture a true picture of microbial life.

Transcriptomics Genomes only provide information about the potential function of organisms. Having a gene does not mean that this gene is also expressed inside the host or during a particular condition. The transcriptome is the set of all RNA molecules produced in one or a population of cells. In order to understand the physiology of the microorganisms we need to know their transcriptome.

Metatranscriptomics The transcriptome of a community is the union over the transcriptomes of each of its members. In metatranscriptomic studies: bulk RNA is extracted directly from environmental samples the RNA is reverse-transcribed the resulting RNA-Seq libraries are sequenced Metatranscriptomic studies are essential in order to understand the physiology of the microbiome.

Challenges of working with metatranscriptomic data The volume of sequencing data is several orders of magnitude larger than single organisms. Reads can come from hundreds of different species, each with a different abundance level. In addition to having a range in the abundance levels of the microorganisms, genes are expressed at drastically different levels. Genes usually have multiple isoforms. Metatranscriptomics has to deal with all of the challenges of metagenomics (1 and 2) plus some extra challenges (3 and 4), therefore algorithms devised for metagenomic data can also be applied to metatranscriptomic data.

(Meta)Transcriptome assembly types Genome independent reconstruction (de novo): de Bruijn k-mer graph Velvet (2008) Trinity (2011) Genome guided reconstruction: spliced read mapping exon identification splice graph Cufflinks (2010) TRIP (2012)

(Meta)Transcriptome assembly types Genome independent reconstruction (de novo): de Bruijn k-mer graph Velvet (2008) Trinity (2011) Genome guided reconstruction: spliced read mapping exon identification splice graph Cufflinks (2010) TRIP (2012)

Outline Motivation Previous approaches Algorithms Results Ongoing work

Clustering reads into bins Analysis of environmental samples is difficult. To simplify the assembly process, many metagenomic tools have been developed to cluster the reads into bins (i.e. species). Algorithms developed for binning metagenomic reads can also be applied to (meta)transcriptomic reads (bins represent transcripts instead of species).

Types of reads binning algorithms Genome dependent CompostBin(2008) Metacluster(2012) DNA composition patterns. G+C content, dinucleotide frequencies vary amongst species. Drawbacks: achieve reasonable performance only for long reads (800~1000 bp, [Wu et al, 2011]) NGS technologies produce short reads Genome independent AbundanceBin(2011) MultiBin(2011) K-mer frequencies are usually linearly proportional to a genome’s abundance. Sufficiently long k-mers are usually unique. Works with short sequencing reads. Drawbacks group together reads from different species if they have close abundance levels do not perform well on species with low abundances

Metacluster (2012) Two round unsupervised binning algorithm: the first round clusters high-abundance reads filter reads with 16-mer that appear less than T times the reads are grouped based on shared 36-mers the groups are clustered using 5-mer distributions the second round clusters the remaining reads (low- abundance) those with unique 16-mers are discarded the reads are grouped based on shared 22-mers the groups are clustered using 4-mer distributions Advantages: better binning of reads from low abundance species uses both techniques: k-mer abundances and DNA composition patterns

MultiBin (2011) Processes multiple samples (N > 1) of the same microbial community. Clusters the reads into b bins (b is the number of species). Binning algorithm: all reads are pooled together a graph G=(V, E) is generated V is the set of reads edges connect reads with substantial overlap (50 bp) greedily partition the vertices into a set, the tags, s.t. each read is either a tag or affiliated with one which substantially overlaps it cluster the set of tags using Vt=(ct1,ct2, …, ctN), cti=number of reads from sample i which substantially overlap tag t each non-tag read is assigned to the same bin as its affiliated tag b needs to be known in advance or estimated somehow. Uses abundance differences from any of the samples to tell low abundance species apart. The algorithm is quadratic in the total number of reads Binning algorithm: all reads are pooled together a graph G=(V, E) is generated V is the set of reads edges connect reads with substantial overlap (50 bp) greedily find a maximal independent set in G, the set of tags each read is now either a tag or affiliated with a tag which substantially overlaps it Vt=(ct1,ct2, …, ctN), cti=number of reads from sample i which substantially overlap tag t perform k-medoids clustering on the set of tags assign each non-tag read to thebin which holds its affiliated tag

Outline Motivation Previous approaches Algorithms Results Ongoing work

Our approach We propose a novel method for unsupervised abundance-based multiple samples reads binning algorithm in brief, for N>1 samples: we split the reads into k-mers and count how many times they appear in each sample we run a sample-by-sample error removal algorithm we pool the k-mer counts together to get N-dimensional count vectors we run the error removal algorithm once more on all of the k-mers we use the structure of the de Bruijn graph defined by the k-mers and the counts to partition the graph into paths or chordless cycles (putative pseudo- exons)

Pseudo-exons Pseudo-exons are defined as substrings whose k-mers and (k+1)- mers appear in the same transcripts with the same multiplicity. T1 T2 T3 T4 A B C D A B C A A B C E F F E D Exon signatures: A: T1x1 T2x2 T3x1 B: T1x1 T2x1 T3x1 C: T1x1 T2x1 T3x1 D: T1x1 T3x1 E: T3x1 T4x1 F: T3x1 T4x1 Pseudo-exons: A BC D E F Notice that exons E and F form different pseudo-exons because in T4 they are in reverse order compared to T3.

K-mer counting we count k-mers and (k+1)-mers using Jellyfish (2011) Jellyfish was designed for shared memory parallel computers with more than one core, it uses several lock-free data structures and multi-threading to count k-mers much faster than other tools formally, for a given value of k, we count the number of occurrences of all k-mers in each of the N samples our algorithms assume we have strand nonspecific data, therefore the counts of complementary k-mers are summed together as they are indistinguishable we store k-mers in canonical form (the smaller value lexicographically between a k-mer and its reverse complement) we combine the counts over all the samples into a list of N- dimensional vectors the maximum value for k supported by Jellyfish is 31

De Bruijn graph We construct the de Bruijn graph; vertices are k-mers and edges are (k+1)-mers. Because we store vertices in canonical form, each vertex represents two k-mers: itself and its reverse-complement. We add an edge between any two k-mers if and only if there is a (k+1)-mer in the reads such that its prefix of length k in canonical form matches one of the vertices, while its suffix of length k in canonical form matches the other one. We will sometimes use vertices to refer to k-mers and edges to refer to (k+1)-mers.

De Bruijn graph ACG CGT CGC GCG Vertices ACG CGC CGA ATC Edges ACGC CGCG ACGA ATCG CGA TCG ATC GAT The lists of k-mers and (k+1)-mers define an implicit representation of the de Bruijn graph, therefore we don’t need to construct it explicitly. Relative to the canonical form of a vertex, for each edge we can say whether it is incoming or outgoing: if the the k-mer matches the 5’ end of either forms of an edge then that is an outgoing edge else it is an incoming edge

Error removal A common approach is to remove k-mers which have counts lower than t >= 1. We found that even for t = 1, we lose too much information, because removing unique k-mers compromises the results for ultra- low abundance transcripts. We found that “tip removal” and “bubble removal” give much better results [Zerbino et al, Velvet, 2008]. These methods use the structure of the de Bruijn graph instead of coverage information to remove k-mers affected by sequencing errors.

Tip and bubble error removal When a read contains a sequencing error the first few k-mers may be correct, until they start to overlap the position where the error occurred this creates a branch going out of the “correct” path this new branch will either end in a leaf (creating a “tip”), or if the read is long enough, the k-mers will stop overlapping the error and the branch will merge back into the path (creating a “bubble”) A “tip” is a chain of nodes that is disconnected on one end we expect the majority of tips to have a maximum length of 2k removing tips is straightforward; we remove all tips which have a length up to some threshold removing a tip does not disrupt the connectivity of the graph Implementing “bubble” removal is still an ongoing work.

Partitioning the de Bruijn graph From the de Bruijn graph we want to extract putative pseudo-exons. These putative pseudo-exons, if we ignore self-edges, correspond to paths or chordless cycles in the de Bruijn graph. We use the structure of the graph and the vectors with the counts to do the partitioning.

Partitioning the de Bruijn graph assuming perfect data, finding the putative pseudo-exons would simply mean removing all incoming/outgoing edges out of vertices that have an in/out degree greater than 1 because we have sequencing errors we need to distinguish between erroneous and real edges we have the following two cases if we have a correct edge and at least one wrong edge coming out of the same vertex, then we expect the abundance of the correct edge to dwarf the sums of the erroneous, therefore the technique described earlier would remove the wrong edges and keep the correct one if we have two correct edges coming out of the same vertex we want to remove both of them, assuming none of the edges comes from ultra-low abundance transcripts, then we expect none of the edges to pass the ratio test and our algorithm should remove both of them

Partitioning the de Bruijn graph We have the following cases: If a vertex has indegree and outdegree equal to at most 1 (it is on a path), we do nothing. If a vertex has outdegree (indegree) greater than 1, then we remove all outgoing (incoming) edges from that vertex however, we keep the most abundant edge if the ratio between its abundance and the sum of the abundances of the other edges is higher than a threshold 0 < e < 1 The value of e should be close to 1 (e.g. 0.97)

Our approach We propose a novel method for unsupervised abundance-based multiple samples reads binning algorithm in brief, for N>1 samples: we split the reads into k-mers and count how many times they appear in each sample we run a sample-by-sample error removal algorithm we pool the k-mer counts together to get N-dimensional count vectors we run the error removal algorithm once more on all of the k-mers we use the structure of the de Bruijn graph defined by the k-mers and the counts to partition the graph into paths or chordless cycles (putative pseudo- exons)

Outline Motivation Previous approaches Algorithms Results Ongoing work

Test data - error free GNF Atlas [Su et al, 2004] is a dataset which contains information about the expression levels of a set of genes in several human tissues From this dataset we extracted the expression levels of 19,371 genes in 10 human tissues We used only one isoform per gene We simulated 30 million error free RNA-Seq paired- reads of length 50 from this dataset using a tool called Grinder (2012)

Test data - with sequencing errors Grinder was very useful for simulating the error free data, however when we wanted to introduce errors its long running time became an issue. Instead, we simulated sequencing errors by using the error free data. We simulated only one type of errors, substitutions, because these are the most common type found in Illumina datasets. We introduced substitutions into the error free reads with a probability of 0.1%, 0.5% and 1% per base.

K-mer counts in the simulated data transcripts reads #30-mers #31-mers error k #correct k-mers #incorrect #missing k-mers percentage of incorect k-mers 49,572,543 49,611,691 0% 30 49,512,839 59,704 31 49,546,279 65,412 0.1% 49,508,364 109,380,690 64,179 68.84% 49,540,750 108,528,925 70,941 68,66% 0.5% 49,483,820 404,415,622 88,723 89.1% 49,510,299 402,714,823 101,392 89.05% 1% 49,433,000 708,460,569 139,543 93.48% 49,446,451 706,531,643 165,240 93.46% Because of ultra-low abundance transcripts, even in the error free data we have missing k-mers Even for an error rate of 0.1% we notice that the number of unique k- mer more than triples when compared to the error free data, 70% of which do not appear in the transcripts This shows the importance of error removal/correction algorithms

Efficiency of different error removal techniques Error removal method error #correct 31-mers #incorrect 31-mers #missing 31-mers none 0% 49,546,279 65,412 non-unique in at least 1 sample 46,520,486 3,091,205 0.1% 46,257,210 6,906,069 3,354,481 remove tips <= 21 over the union of the samples 49,502,470 21,772,808 109,221 remove tips <= 60 over the union of the samples 49,236,255 12,120,069 375,436 remove tips <= 21 sample-by-sample and over the union of the samples 49,127,456 10,998,956 484,235 remove tips <= 60 sample-by-sample and over the union of the samples 48,707,567 5,304,983 904,124

Results for the graph partitioning error error removal technique edge removal threshold #pseudo-exons >= 50bp #pseudo-exons >= 50bp and do not have wrong 31-mers #30-mers %transcriptome covered by col 5 0% none 1 60,285 49,299,665 99.5% 65,051 49,239,335 99.3% 0.1% remove tips <= 60 over the union of the samples 0.97 416,853 60,335 37,815,256 76.3% remove tips <= 60 sample-by-sample and over the union of the samples 228,841 77,010 46,699,963 94.2% The first row (green) uses all k-mers, the counts are computed from the transcripts.

Outline Motivation Previous approaches Algorithms Results Ongoing work

Ongoing work We believe that bubble removal will help us get rid of most of the erroneous k-mers which are still present in the data after tip removal. We want to incorporate an error correction algorithm, SEECER (2013), to correct the erroneous reads before doing starting the k-mer counting. Currently our algorithms do not take into account strand strand-specific RNA-Seq data. Optimizing the algorithms to take advantage of this information, when available, represents another opportunity to improve the results of this approach.

Q&A