Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Doug Brutlag 2011 Sequencing the Human Genome Doug Brutlag Professor Emeritus of Biochemistry.
Annotating a Scarlet Runner Bean genome fragment put together by shotgun sequencing Scarlet Runner ean Max Bachour.
Genome Assembly: a brief introduction
SEQUENCING-related topics 1. chain-termination sequencing 2. the polymerase chain reaction (PCR) 3. cycle sequencing 4. large scale sequencing stefanie.hartmann.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Stuff to Do. Midterm I questions due 1/31 me your question (with answers), –if you have the capability, mail complete questions, figures, etc. and.
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
How to access genomic information using Ensembl August 2005.
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genome sequencing and assembling
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Genome Sequencing and Assembly High throughput Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
De-novo Assembly Day 4.
Genomics Chapter 18.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Tomato genome annotation pipeline in Cyrille2
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
CUGI Pilot Sequencing/Assembly Projects Christopher Saski.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Finishing tomato chromosomes #6 and #12 using a Next Generation whole genome shotgun approach Roeland van Ham, CBSG, NL René Klein Lankhorst, EUSOL Giovanni.
Chromosome 2 Doil Choi, Sunghwan Jo KOREA. Cytological architecture of chromosome kb/µm DAPI (4’-6-diamidino-2-phenylindole) stained pachytene chromosome.
HeterochromatinEuchromatin Relative chromosome length Relative bivalent diameter X 1.23 X 1.00 Relative area Relative optical density.
Human Genome.
Genomics Education Partnership: a flexible approach to implement Genomic teachings and research in the classroom Matthew W. Wadsworth and Consuelo J. Alvarez,
Today Please read… Science 291: Human Genome Project Dissenters My Brush with Greatness? 1992: Two years into the HGP, two of the projects.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Welcome to the combined BLAST and Genome Browser Tutorial.
Human Genome Project.
Genome sequence assembly
Pre-genomic era: finding your own clones
Finishing the human genome sequence?
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
Stuff to Do.
The Release 5.1 Annotation of Drosophila melanogaster Heterochromatin
CSCI 1810 Computational Molecular Biology 2018
Sequence the 3 billion base pairs of human
Human Genome Project Seminal achievement. Scientific milestone.
Presentation transcript:

Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?

Matcher matched… …matched Celera reads with PFP BACTIGS, –20.76 million Celera reads matched (76%), –0.62 million had a mate pair that matched, 2.97 million Celera reads were unique and un- screened, –1.189 Gbp of unique DNA sequence, at 5.11X yields a predicted 240 Mbp of unique Celera sequence.

Combining Assembler assembles… “…Celera and PFP sequence for a transient assembly” …first, Celera reads, –are checked for over-collapsed regions, sequences with Mate Pairs that match region are kept, more mate pair matches = higher value assembly, …then Celera reads are combined with PFP reads, “Greedy” program recognizes highest value assemblies first in order to build contigged sequence, …then “Stones” to fill the gaps.

Results… PFP vs. CSA The GenBank (PFP) data for the Phase 1 and 2 BACs yielded an average of 19.8 bactigs per BAC, of average size 8099 bp, Application of the Combining Assembler resulted in individual Celera/BAC assemblies being put together into an average of 1.83 scaffolds (median of 1 scaffold) per BAC region consisting of an average of 8.57 contigs of average size 18,973 bp. pp. 1313, 1st column, last paragraph

Next Paper? Sorcerer II News? mycoplasma CR

Monday?

Compartmentalized Shotgun Assembly ?

Celera Unique Scaffolds WGA The 5.89 million Celera fragments not matching the GenBank data were assembled with the whole-genome assembler. The Celera assembly resulted in a set of scaffolds totaling 442 Mbp in span and consisting of 326 Mbp of sequence. More than 20% of the scaffolds were >5 kbp long, and these averaged 63% sequence and 27% gaps with a total of 302 Mbp of sequence.

Compartmentalized Shotgun Assembly ? ?

Tiler tiles… Scaffolds into larger components using –Mate End Pairs, –BAC-end pairs, –STS, Heuristic: a rule of thumb, simplification, or educated guess that reduces or limits the search for solutions in domains that are difficult and poorly understood. Unlike algorithms, heuristics do not guarantee optimal (or even feasible) solutions and are often used with no theoretical guarantee.

Compartmentalized Shotgun Assembly * 3,845 Components shredded, WGA

> 100 kbp Scaffolds; –92% sequence, 8% gaps, –105,264 gaps, 1,935 scaffolds, –1.3 Mbp scaffold size, 23,242 bp contig size. –> 49% gaps < 500 bp, –> 62% gaps < 1 kb, –No gap larger than 100 kbp. 93%

How do you compare assemblies?

WGA vs. CSA This gives some measure of consistent coverage: –1.982 Gbp (95.00%) of the WGA is covered by the CSA, – Gbp (87.69%) of the CSA is covered by the WGA. Only 31 scaffolds were ~unique to an assembly, 295 kb (0.012%) CSA inconsistent with WGA, Mb (0.11% WGA inconsistent with CSA, small regions Overall, CSA slightly better than WGA… Why? How does the CSA compare with the Clone-by-Clone approach?

Hierarchical Clone-by-Clone Whole Genome Assembly Map First: then sequenceSequence First: then map

Mapping Scaffolder GM99 and fingerprint maps

Tab. 4 ?

Assembly and Validation Analysis …did it really work? Completeness: % of euchromatic sequence in the assembly, –estimate the size and # of gaps (Table 3), 92.2 % Sequence 7.8 % Gaps CSA 116,442 Gaps 91 % Sequence 9 % Gaps WGA 102,068 Gaps 92.5 % Sequence 12.9 % Gaps PFP Small gaps (554 bp) = 145,514 Gaps, Large gaps (35 kb) = 4076 Gaps.

Assembly and Validation Analysis …did it really work? Completeness: % of euchromatic sequence in the assembly, –estimate the size and # of gaps (Table 3), –compare to “finished” sequences of 21, Mb gaps, 75% gaps are repeats, –match with STS data (ePCR, BLAST), 93.4% tested found assembled, 5.5% in “chaff” = 98.9%, Correctness: –Mate-Pair analysis.

Mate Pair Analysis Valid: correct orientation and correct distance + 3 SD 2.7% were found to be invalid.

CSA vs. PFP What does this show?

PFP Chromosome 21 CSA Green: Same Order, Orientation Yellow: Same Orientation Red: Out of Order, Orientation Blue: Gaps Violations: Red : misoriented Yellow: distance

Chromosome 8 PFP CSA

PFP CSA

What’s the take home message?

Blue: breaks Red: gaps > 10kb Fig. 7, key PFP CSA

Fig. 7

Gene Prediction and Annotation Why’s it So Hard to Find Genes? Exons/Introns, Alternative Splicing/Termination, Alternate transcription start/stop sites, Tandem Repeats, Psuedogenes, etc. We don’t really understand all there is to know about gene and genome structure, etc.

Gene Number Predictions? …before PFP, WGA or CSA Textbooks: ~100,000 Upgraded to 142,634? EST data “…counts [that] fall far short…” EST Data --> 35,000 35,000 genes based on the density of Chromosome 22 28, ,000 Humans vs. pufferfish

Automated Gene Annotation OTTO Tell me how it works. How was it validated, including Table 7. …if necessary, use the Online Primer and other NCBI resources to broaden your understanding, –cDNAs, ESTs, RefSeq, Protein Sequence Databases, BLAST, etc. are described in appropriate detail on the WEB.

Questions?

Repeat Resolver...most of the remaining gaps were due to repeats. “Rocks” Use “low Discriminator Value” contig sets to fill gaps, - find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 10 7 ), “Stones” - find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.

Repeat Resolver...most of the remaining gaps were due to repeats. “Rocks” Use “low Discriminator Value” contig sets to fill gaps, - find two or more mate pairs with unambiguous matches in the scaffold near the gap (2 kb, 10kb or 50 kb), (1 in 10 7 ), “Stones” - find mate pair matches 2 kb, 10 kb, and 50 kb from gap, place the mate in the gap, check to see if it’s consistent with other “placed” sequences.