Efficient Clustering of Large EST Data Sets on Parallel Computers CECS 694-04 Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Suffix Trees Construction and Applications João Carreira 2008.
Transcriptome Sequencing with Reference
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees and Derived Applications Carl Bergenhem and Michael Smith.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Next Generation Sequencing, Assembly, and Alignment Methods
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
15-20 september WABI031 A Method to Detect Gene Structure and Alternative Splice Sites by Agreeing ESTs to a Genomic Sequence Paola Bonizzoni Graziano.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Assembly.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L12 Gene Finding.
1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Lecture 12 Splicing and gene prediction in eukaryotes
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Genome Sequencing & App. of DNA Technologies Genomics is a branch of science that focuses on the interactions of sets of genes with the environment. –
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.
LECTURE 2 Splicing graphs / Annoteted transcript expression estimation.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Chapter 2: From genes to Genomes. 2.1 Introduction.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Union-find Algorithm Presented by Michael Cassarino.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
EB3233 Bioinformatics Introduction to Bioinformatics.
Prokaryotic cells turn genes on and off by controlling transcription.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
Research about Alternative Splicing recently 楊佳熒.
Development of a Chicken Unigene Database Project No. 9 Mentors: Dr. Wellington Martins - Dr. Joan Burnside Animal Science Dept. University of Delaware.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Bioinformatics for Research
The Transcriptional Landscape of the Mammalian Genome
Lettuce/Sunflower EST CGPDB project.
Human Cells Gene Expression
Introduction to Bioinformatics II
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
CSE 589 Applied Algorithms Spring 1999
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Alternative Splicing and my research report
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003, 31(11), Presented by Elizabeth Cha

2 Problem Statement We are given an EST database from a single species, where multiple EST sequences may belong to the same gene. We want to find an efficient algorithm to cluster EST sequences, so that all EST sequences in a cluster belong to a single gene. (It’s possible to have more than one cluster for a gene.)

3 Efficient Algorithm Considerations Memory efficiency to reduce the memory required to linear in the size of input Computational efficiency without sacrificing the quality of clustering Reduction of run-time of clustering large EST data sets by parallel processing (e.g. MPI)

4 EST database (dbEST) Expressed Sequence Tag (EST) representations provide a dynamic view of genome content and expression > 5 million human ESTs > 3.5 million mouse ESTs Reference information: dbEST (ncbi.nlm.nih.gov/dbEST/dbEST_summary.html)ncbi.nlm.nih.gov/dbEST/dbEST_summary.html

5 What is EST? A unique DNA sequence derived from a cDNA library. The length of EST is around 200 ~ 500 nucleotides long. ESTs are generated by sequencing either one of both ends of an expressed gene. The EST can be mapped, by a combination of genetic mapping procedures, to a unique locus in the genome and serves to identify that gene locus.

6 An overview of the process of protein synthesis Image adopted by

7 An overview of how ESTs are generated. Image adopted from ncbi.nlm.nih.gov/About/primer/est.html

8 Current Problems in dbEST Imposing size of EST database Low sequence quality Highly similar (but distinct) gene family members Chimeric cDNA clones Retained introns and alternatively spliced transcripts Incomplete gene coverage Other limitations

9 Types of alternative splicing Skipped exons Retained introns Alternative donor or acceptor site Image adopted from Trends in Genetics, 2002, 18(1), 53-57

10 How to solve the problems Remove the redundancy by clustering ESTs representing the same native transcripts Current software for clustering ESTs UniGene STACK (Sequence Tag Alignment and Consensus Knowledgebase) HGI (Human Gene Index) TIGR Assembler CAP3 Phrap

11 Goals of clustering ESTs Each cluster represents a distinct gene, including all alternative transcript isoforms derived from the same gene (e.g. UniGene). Each cluster is deemed to represent a distinct mRNA transcript (e.g. CAP3, TIGR Assembler, Phrap). ESTs and first categorized by their RNA source and are subsequently clustered separately for each source sample (e.g. STACK).

12 Ideas to get evidential gene or transcript 1. Pairwise sequence alignment with dynamic programming algorithm 2. Fast identification of promising pairs with good quality overlap 3. Report pairs based on maximal common substrings

13

14 PaCE (Parallel Clustering of ESTs) A software program for EST clustering on parallel computers 2 reasons for this combination enables clustering and assembly of large-scale EST data sets Memory requirement: grows linearly in the size of input The input size is reduced from the complete set of ESTs to the size of the biggest cluster

15 EST Clustering Given: ESTs drawn from multiple mRNAs Partition: The ESTs into clusters such that ESTs from the same gene are put together in a distinct cluster

16 EST Clustering (Cont’d) Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

17 EST Clustering Algorithm Initially, treat each EST as a cluster by itself If two ESTs from two different clusters show significant overlap, merge the clusters Output the clusters once finished

18 EST Clustering (Cont’d) Merging Clusters Successful overlap results in: Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

19 Determining Overlaps Compute only lower and upper rectangles Do banded dynamic programming Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

20 Maximum Common Substring Given: a set of strings Find: Pairs of strings that have a maximal common substring ≥ a threshold φ Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ.

21 Organization of PaCE 1. Build a distributed representation of the GST data structure in parallel 2. Use a single processor to handle maintaining and updating the EST clusters

22 Generalized Suffix Tree (GST) A GST for a set of n sequences is a suffix tree constructed using all suffixes of the n sequences.

23 Basic Concept of Suffix Tree A suffix tree is a data structure that exposes the internal structure of a string in a deeper way than does the fundamental preprocessing.

24 Definition of Suffix Tree 1. A suffix tree T for an m -character string S is a rooted directed tree with exactly m leaves numbered 1 to m. 2. Each internal node has at least 2 children and each edge is labeled with a nonempty substring of S. 3. No 2 edges out of a node can have edge- labels beginning with the same character.

25 Definition of Suffix Tree (Cont’d) 4.Key feature: for any leaf i, the concatenation of the edge- labels on the path from the root to leaf i exactly spells out the suffix of S that starts at position i.

26 Ukkonen’s Algorithm to Construct a Suffix Tree Construct tree I 1 (It is just the single edge labeled by character S (1)) for i = 1 to m -1 do begin {phase i +1} for j = 1 to i +1 begin {extension j } Find the end of the path from the root labeled S [ j.. i ] in the current tree. If needed, extend that path by adding character S ( i +1). end;

27 Suffix Tree Construct a suffix tree of sequence gaac

28 Suffix Tree (Cont’d) Image adopted from article (1999) Nucleic Acids Research, 27,

29 Main idea to use GST data structure Image adopted from the Summer Lecture of Dr. Aluru’s of Iowa State Univ. Maximal Common Substring

30 Parallel Clustering A master-slave paradigm is used. Master processor: maintains and updates the clusters Slave processors: 1. Generate pairs as demanded by the master processor 2. Perform pairwise alignments of the pairs dispatched by the master processor Data structure for maintaining the clusters: union-find algorithm

31 Software availability PaCE is freely available for non-profit, academic use. To request source code and executables Contact information :

32 Quality Assessment Benchmark data set: Arabidopsis thaliana 168,200 ESTs Small genome (114.5 Mb / 125 Mb total) has been sequenced in year 2000 Reference information: utarabidopsis.html

33 Achievements of PaCE Reduce the worst-case memory requirement from quadratic to linear Generate promising pairs in decreasing order of maximal common substring length and cluster the ESTs such that the number of pairwise alignments is reduced by an order of magnitude without affecting the quality of clustering Reduce the number of duplicates generated for each promising pairs

34 Future Research Extend PaCE to do assembly and build consensus sequences in parallel Incorporate quality values available to ESTs as part of input Ensure quality clustering and assembly

35 System used to implement IBM xSeries cluster 30 dual-processor nodes 1.26 GHz Intel Pentium III processors connected by Myrinet 2.25 GB memory at each node 512 MB of RAM

36 Quality Assessment of PaCE and CAP3