JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.

Slides:



Advertisements
Similar presentations
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Advertisements

ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Pamela Ferretti Laboratory of Computational Metagenomics Centre for Integrative Biology University of Trento Italy Microbial Genome Assembly 1.
Lectures on Network Flows
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
Automated Layout and Phase Assignment for Dark Field PSM Andrew B. Kahng, Huijuan Wang, Alex Zelikovsky UCLA Computer Science Department
Background: Scan-Based Delay Fault Testing Sequentially apply initialization, launch test vector pairs that differ by 1-bit shift A vector pair induces.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Projects. Dataflow analysis Dataflow analysis: what is it? A common framework for expressing algorithms that compute information about a program Why.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Henrik Lantz - BILS/SciLife/Uppsala University
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Sequencing a genome and Basic Sequence Alignment
(Not in text).  An LP with additional constraints requiring that all the variables be integers is called an all-integer linear program (IP).  The LP.
Genome sequencing and assembly Mayo/UIUC Summer Course in Computational Biology Genome sequencing and assembly.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
De-novo Assembly Day 4.
3.4 Solving Systems of Linear Inequalities Objectives: Write and graph a system of linear inequalities in two variables. Write a system of inequalities.
Lecture 12-2: Introduction to Computer Algorithms beyond Search & Sort.
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
GENOME SEQUENCING AND ASSEMBLY Mayo/UIUC Summer Course in Computational Biology.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Meraculous: De Novo Genome Assembly with Short Paired-End Reads
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Sequencing a genome and Basic Sequence Alignment
The Changing Face of Sequencing
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Jigsaw Puzzles with Pieces of Unknown Orientation Andrew C. Gallagher Eastman Kodak Research Laboratories Rochester, New York.
billion-piece genome puzzle
Class 01 – Fragment assembly. DNA sequence data DNA sequence data is the motherlode of molecular biology. 10^10 base pairs. One human genome/year. It.
Graphs A graphs is an abstract representation of a set of objects, called vertices or nodes, where some pairs of the objects are connected by links, called.
Accessing and visualizing genomics data
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
Meet the ants Camponotus floridanus Carpenter ant Harpegnathos saltator Jumping ant Solenopsis invicta Red imported fire ant Pogonomyrmex barbatus Harvester.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
CSE280Stefano/Hossein Project: Primer design for cancer genomics.
Learning Hidden Graphs Hung-Lin Fu 傅 恆 霖 Department of Applied Mathematics Hsin-Chu Chiao Tung Univerity.
When the next-generation sequencing becomes the now- generation Lisa Zhang November 6th, 2012.
Linear Programming (LP) Vector Form Maximize: cx Subject to : Ax  b c = (c 1, c 2, …, c n ) x = b = A = Summation Form Maximize:  c i x i Subject to:
Cross_genome: Assembly Scaffolding using Cross-species Synteny
Genomics Sequencing genomes.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Metafast High-throughput tool for metagenome comparison
Denovo genome assembly of Moniliophthora roreri
M. roreri de novo genome assembly using abyss/1.9.0-maxk96
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Research in Computational Molecular Biology , Vol (2008)
Sequence comparison: Local alignment
Very important to know the difference between the trees!
Transcriptome Assembly
Introduction Basic formulations Applications
MTH-5101 Practice Test (1) x ≥ 40 (2) y ≥ 0 (3) y ≤ 140
Lecture 19 Linear Program
Fragment Assembly 7/30/2019.
Presentation transcript:

JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia State University

De-novo Assembly Paradigm shotgun sequencing short contigs the scaffolds short reads the genome denovo assembly scaffolding

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Scaffold gene XYZ No scaffold

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap filling Structural variation! gene XYZ3’ UTR5’ UTR Sanger Sequencing gene XYZ3’ UTR5’ UTR Biologist: There are holes in my genes!

Why Scaffolding? Annotation  Comparative biology Re-sequencing and gap Filling Structural variation!

Massive Sequencing Projects Effects of Read Length I5k  5000 insect and arthropod species G10k  10,000 vertebrate species Dog Genome  7.5x Sanger  N50: 180Kb Chicken Genome  6x Illumina  N50: 12Kb Human Genome  100x Illumina  N50: 24Kb Fragmented Genomes

The Scaffolding Problem GIVEN CONTIGS, PAIRED READS FIND ORIENTATION, ORDERING, RELATIVE DISTANCE GOAL RECREATE TRUE SCAFFOLDS

Paired Read Construction Paired Read Styles Mate Pair Paired End Paired Reads 2kb same strand and orientation R1 R2 100b 10kb different strand and orientation R1 R2

Linkage Information Possible States (mate pair) Two contigs are adjacent if:  A read pair spans the contigs State (A, B, C, D)  Depends on orientation of the read  Order of contigs is arbitrary Each read pair can be “consistent” with one of the four states 5’ 3’ contig icontig j R1 R2 A B C D

Nodes Edges Nodes are contigs Adjacent contigs have 4 edges (one for each state) Weighted by overlap with repetitive region Scaffolding Graph contig i contig j State A

Integer Linear Program Formulation Variables Contig pair state: Contig orientation: Adjacent contig consistency: Objective Maximize weight of consistent pairs

Constraints Variables Contig pair state: Contig orientation: Adjacent contig consistency: Pairwise Orientation

Constraints Variables Contig pair state: Contig orientation: Adjacent contig consistency: State Variables

Constraints Variables Contig pair state: Contig orientation: Adjacent contig consistency: Mutual Exclusivity

Constraints Forbid 2 Cycles Forbid 3 Cycles *larger cycles are broken at the end

Largest Connected Component

Graph Decomposition: Articulation Points solvesolve solvesolve Articulation point MIP, Salmela 2011

Largest Biconnected Component

Non-Serial Dynamic Programming A technique which exploits the sparsity of the scaffolding graph by computing the solution in stages, incorporating the results from previous stages ~inspired by (Neumaier, 06)

Non-Serial Dynamic Programming 2-cut

Non-Serial Dynamic Programming Objective Modification:

SPQR-tree Based Implementation SPQR-tree efficiently finds 2 cuts (Tarjan, 73) DFS of SPQR-tree is used to schedule elimination order for NSDP

Post Processing ILP Solution May have cycles Not a total ordering for each connected components A B C D F E ILP Solution outgoing incoming A B C D E F A B C D E F Bipartite matching Objectives:  Max weight  Max cardinality  Max cardinality / Max weight

GAGE Framework GenomeSize (Mb)# reads Staphlococcus Aureus2.93,494,070 Rhodobacter sphaeorides4.62,050,868 Human Chr ,669,408 Assembled using:  ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet Scaffolded using:  SILP (our method), Opera, MIP, Bambus2

Testing Metrics

Results

Conclusions Success  ILP solves scaffolding problem!  NSDP works Improvements  Include SOAPdenovo, Allpaths-LG scaffolds in comparison  Look at parameter effects  Practical considerations (read style, multi-libraries, merge ctgs) Future Work  Where else can I apply NSDP?  Scaffold before assembly … promising  Structural Variation??

Questions?