Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee.

Slides:



Advertisements
Similar presentations
CS 336 March 19, 2012 Tandy Warnow.
Advertisements

Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Proprietary Signal Generation and Imaging Photons Generated Reagent Flow PicoTiterPlate Wells Sequencing By Synthesis 1600K field of addressable wells.
Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick Genome Assembly.
MCB Lecture #9 Sept 23/14 Illumina library preparation, de novo genome assembly.
Next Generation Sequencing, Assembly, and Alignment Methods
SplitMEM: graphical pan-genome analysis with suffix skips Shoshana Marcus May 29, 2014.
Lecture 14 Genome sequencing projects
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Genome sequencing and assembling
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Mouse Genome Sequencing
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
Bacterial Genome Assembly C. Victor Jongeneel Bacterial Genome Assembly | C. Victor Jongeneel | PowerPoint by Casey Hanson.
Todd J. Treangen, Steven L. Salzberg
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
PE-Assembler: De novo assembler using short paired-end reads Pramila Nuwantha Ariyaratne.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
1 Velvet: Algorithms for De Novo Short Assembly Using De Bruijn Graphs March 12, 2008 Daniel R. Zerbino and Ewan Birney Presenter: Seunghak Lee.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Metagenomics Assembly Hubert DENISE
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
RNA Sequence Assembly WEI Xueliang. Overview Sequence Assembly Current Method My Method RNA Assembly To Do.
Human Genome.
Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
CS 173, Lecture B Introduction to Genome Assembly (using Eulerian Graphs) Tandy Warnow.
Sequencing technologies and Velvet assembly Lecturer : Du Shengyang September 29 , 2012.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
Bacterial Genome Assembly Tutorial: C. Victor Jongeneel Bacterial Genome Assembly v9 | C. Victor Jongeneel1 Powerpoint: Casey Hanson.
Lesson: Sequence processing
Assembly algorithms for next-generation sequencing data
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
CAP5510 – Bioinformatics Sequence Assembly
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Introduction to Genome Assembly
Removing Erroneous Connections
CS 598AGB Genome Assembly Tandy Warnow.
Genome Sequencing and Assembly
Fragment Assembly 7/30/2019.
Presentation transcript:

Gena Tang Pushkar Pande Tianjun Ye Xing Liu Racchit Thapliyal Robert Arthur Kevin Lee

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

Flow Order TACGTACG 1-mer 2-mer 3-mer 4-mer KEY (TCAG) Measures the presence or absence of each nucleotide at any given position

Margulies et al., /watch?v=kYAGFrbGl6E

 ~300nt per read  40X coverage (but widely varying, 12-80)

 ~ 2 Mb genome ( Mb)  Mostly coding sequences ◦ Good for assembly  Reference genomes of 4 closely related species ◦ H. influenzae ◦ H. parasuis ◦ H. ducreyi ◦ H. somnus

 High degree of genomic plasticity ◦ 10% of genes in clinically isolated strains are novel  In situ horizontal gene transfer  Supragenome – distributed genome hypothesis  Reference mapping relatively ineffective  Making life difficult!!!

De novoReference mapping Number of large contigs Unmapped reads N50 contig size Bases in large contigs (Mb) 21.3

 Assemble all of the H. hemolyticus genomes together  Should give a more complete mapping because of higher coverage ◦ 40X * 5 genomes  200X coverage  But … we get 3700 contigs ◦ (average of 50 for single strain assembly)

 These data hint at rampant recombination  Reference mapping relatively worthless

 Whole-genome alignment (intra-species) ◦ On average, 27 insertions, 147 deletions (>90bp) ◦ Average length of non-matching seq = 321kb (18%)

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

 Read: a long word that comes out of sequencer  Mate pair: a pair of reads from two ends of the same insert fragment  Contig: a contiguous sequence formed by several overlapping reads with no gaps.  Supercontig (scaffold) an ordered and oriented set of contigs, usually by mate pairs.  Consensus sequence: sequence derived from the multiple alignment of reads in a contig

 Goal: Find the shortest common sequence of a set of reads.  Input: reads {s1, s2, s3, …}  Output: find the shortest string T such that every s_i is a substring of T.  Comment: This is NP-hard problem, we need to use some approximation algorithm.

 Process: (1) Calculate pairwise alignments of all fragments. (2) Choose two fragments with the largest overlap. (3) Merge chosen fragments. (4) Repeat step 2 and 3 until only one fragment is left.

Best one Take pairwise alignment Merge the best one Input reads

 Comment:  Greedy algorithm was the first successful assembly algorithm implemented.  Used for organisms such as bacteria, single- celled eukaryotes.  It has some efficiency limitation

 This approach is based on graph theory.  Assemblers based on this approach: Arachne, Celera, Newbler etc.

 Sort all k-mers in reads (k~24) (1) Find pairs of reads sharing a k-mer (2) Extend to full alignment-throw away if not > 95% similar TACATAGATTACACAGATTACT GA | | | | | | | | | | | TAGTTAGATTACACAGATTACTAG A

 One caveat: repeats  A k-mer that appears N times, initiates N^2 comparisons.  Solution:  Discard all k-mers that appear more than c*Coverage, (c~10)

 A graph is constructed: (1) Nodes are reads (2) Edges represent overlapping reads CGTAGTGGCAT ATTCACGTAG Overlap graph

 A graph is constructed: (1) Nodes are reads (2) Edges represent overlapping reads CGTAGTGGCAT ATTCACGTAG Overlap graph

 Terminology in graph theory: (1) Simple path--- a path in the graph contains each node at most once. (2) Longest simple path---a simple path that cannot be extended. (3) Hamiltonian path– a path in the graph contains each node exactly once. CGTAGTGGCAT ATTCACGTAG

 Recall: Now we got several contigs(i.e. several longest simple paths)  Find the multiple alignments of these contigs, and get one consensus sequence as our final contig.

 Summary: (1) Based on graph theory (2) Eularian path: a path in a graph which visits every edge exactly once. (3) Example: Euler, Velvet, Allpath, Abyss, SOAPdenovo… (4) Eularian path is more efficient, however, in partice both are equally fast.

 Break reads into overlapping k-mers. Example: 10bp read: ATTCGACTCC for k=5-mers: ATTCG TTCGA TCGAC CGACT GACTC ACTCC

 Nodes: k-mers  Edges: if (k-1) suffix of a node equals (k-1) prefix of a node, add a directional edge between them. ATTCG TTCGATCGAC

 Whenever a node A has only one outgoing arc that points to another node B that has only one ingoing arc, the two nodes are merged. ATTGC TGCAT TGCAG TTGCA ATTGCA

 In Velvet: (1) Error removal (2) Removing tips Tip: a chain of nodes that is disconnected on one end.

 Consider two paths redundant if they start and end at the same nodes (forming a “bubble”) and contain similar sequences.  Such bubbles can be created by errors or biological variants, such as SNPs or cloning artifacts prior to sequencing. Erroneous bubbles are removed by an algorithm called “Tour Bus”.

 Algorithm for directed graphs: (1) Start with an empty stack and an empty circuit (Eulerian path). - If all vertices have same out-degrees as in-degrees - choose any of them. - If all but 2 vertices have same out-degree as in-degree, and one of those 2 vertices has out-degree with one greater than its in-degree, and the other has in-degree with one greater than its out-degree - then choose the vertex that has its out-degree with one greater than its in-degree. - Otherwise no Euler circuit or path exists. (2) If current vertex has no out-going edges (i.e. neighbors) - add it to circuit, remove the last vertex from the stack and set it as the current one. Otherwise (in case it has out-going edges, i.e. neighbors) - add the vertex to the stack, take any of its neighbors, remove the edge between that vertex and selected neighbor, and set that neighbor as the current vertex. (3) Repeat step 2 until the current vertex has no more out-going edges (neighbors) and the stack is empty.

 C_k=C*(L-k+1)/L  N50 size: 50% of genome is in contigs larger than N50 Example: 1Mbp genome Contigs: 300, 100, 50, 45, 30, 20, 15, 15, 10,… N50=30kbp ( =525>=500kbp) Note: N50 is meaningful for comparison only when genome size is the same

 Map k-mer on the reference sequence, get a “location map”.  Map each read onto the “location map” according to the k-mer. AATTGGGTTA location map of 5-ker CCCAATTGAAA AATGGTTACCA

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

 Standard flowgram format (SFF) A binary file format used to encode results of pyrosequencing from the 454 Life Sciences platform for high-throughput sequencing. a header section + read data sections

A summary of general information regarding the file content

Reads' universal accession numbers (h), sequence information (s), quality scores of basecalls (q), clipping positions (c), flowgram values (f) flowgram indices (i) the nucleotide bases + the quality scores

 6 genomes, 6.sff files  Number of reads ranges from to Reads Contigs/ Scaffold Assembler

 GS De Novo Assembler: a software package designed specifically for assembling sequence data generated by pyrosequencing platforms  De novo assembly  Overlap-Layout-Consensus methodology  Better deal with reads greater than 250bp in length  GS Reference Mapper

 Algorithms for de novo assembly  Short read assembly (25~50bp)  Using de Bruijn graphs.  Applying Velvet to very short reads and paired-ends information only can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs.

 Open-source whole genome assembly software - Assemblers: Minimus2 - Validation and Visualization: Hawkeye - Scaffolding: Bambus - Trimming, Overlapping, & Error Correction

 Celera  MIRA  Edena Finishing is a big challenge !

 Sequencing errors: base pair misread, poly A…  It is possible that some portions of genomes are unsequenced  Identical and nearly identical sequences (repeats) can increase the time and space complexity of algorithms exponentially  Gaps & errors

Finishing is taking contigs and yielding a complete sequence. Scaffolder orders contigs into scaffolds based on clone-mate pair information. Some assemblers have a simple quality-control method Check and manually assemble unresolved repeat regions Resequencing

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

 Plan will evolve ◦ Different beast than expected  Write scripts to automate the pipeline  Velvet did not work  MIRA3 requires much time  Edena is not optimized for 454 data

454 sequencing reads.sff files Newbler Newbler? Reference genome What can we use to reconcile assemblies? Merged assembly contigs? Scaffolding? How do we visualize and evaluate assemblies? Convert to.fasta Make all reads equal sized Velvet

454 sequencing reads.sff files Newbler Newbler? Reference genome What can we use to reconcile assemblies? Merged assembly contigs? Scaffolding? How do we visualize and evaluate assemblies? Convert to.fasta Make all reads equal sized Velvet

 Human and chimp = 99% sequence similarity  H. influenzae and H. influenzae = 80% s.s.  H. influenzae and H. hemolyticus = ??? (<80%)

454 sequencing reads.sff files Newbler MIRA3 Add a third? AMOScmp Reference genome Minimus2 Merged assembly contigs Scaffolding: Bambus Visualize and evaluate: Mauve, Hawkeye, amosvalidate Specialized algorithm

MIRA3 Hash tags Contigs for each genome Hash tag indicates repeat? BLAST ends against contigs of conspecifics BLAST ends against completed genomes Ends from different contigs match same contig in 1 or more genomes Link together No mapping or multimapping Further processing? Ends from different contigs map to similar RefSeq regions Link together No mapping or multimapping Further processing? Yes No

 Plans change, and knowledge changes  An automated pipeline is invaluable ◦ What if 30  15 contigs? Gene finding group ◦ Just re-run the scripts

 Core genes vs unique genes  Gene clustering  Codon usage

 Biology background  Algorithms De novo - Overlap-Layout-Consensus - De Bruijn graphs Reference  Tools and Techniques  Work flow & strategy  Group management

 Some task can be divided, some are not.  Complexity of communication  The optimal group size is 3-6 Mini groups

 Group meetings  Wiki main page discussion page  Everyone is special, thus valuable  Start earlier, get closer to perfection