Whole Genome Assembly. WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.

Slides:



Advertisements
Similar presentations
Accurate Assembly of Maize BACs Patrick S. Schnable Srinivas Aluru Iowa State University.
Advertisements

Genome Assembly: a brief introduction
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
9 Genomics and Beyond Brief Chapter Outline
HIV Project -Matt Hagen. The Problem Are there any DNA sequences in common between HIV and human genomes? HIV-1, complete genome, chimeric clone AF HIV-1,
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Assembly.
Whole Genome Sequencing, Comparative Genomics, & Systems Biology Gene Myers University of California Berkeley.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Zebra Finch Seg Dup Analysis 1.Genome 2.Parameters for Pipeline 3.Analysis.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Today’s Lecture Genetic mapping studies: two approaches
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 12, 2012 Metagenome analysis: use case.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.
Gramene Objectives Provide researchers working on grasses and plants in general with a bird’s eye view of the grass genomes and their organization. Work.
Chromosome 2 Doil Choi, Sunghwan Jo KOREA. Cytological architecture of chromosome kb/µm DAPI (4’-6-diamidino-2-phenylindole) stained pachytene chromosome.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Motif discovery and Protein Databases Tutorial 5.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Advancing Science with DNA Sequence Natalia Ivanova MGM Workshop September 29, 2011 Metagenome analysis: use case.
Human Genome.
Bioinformatics and Computational Biology
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Today Please read… Science 291: Human Genome Project Dissenters My Brush with Greatness? 1992: Two years into the HGP, two of the projects.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
(H)MMs in gene prediction and similarity searches.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
What is BLAST? Basic BLAST search What is BLAST?
Chapter 5 Sequence Assembly: Assembling the Human Genome.
What is BLAST? Basic BLAST search What is BLAST?
Virginia Commonwealth University
Human Genome Project.
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Genome sequence assembly
Basics of BLAST Basic BLAST Search - What is BLAST?
Pre-genomic era: finding your own clones
Recurrent inversion breaking intron 1 of the factor VIII gene is a frequent cause of severe hemophilia A by Richard D. Bagnall, Naushin Waseem, Peter M.
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
GEP Annotation Workflow
DNA Sequencing The DNA from the genome is chopped into bits- whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping.
Genome Center of Wisconsin, UW-Madison
Predicting Active Site Residue Annotations in the Pfam Database
How to Build a Horse: Final Report
Identification and Characterization of pre-miRNA Candidates in the C
Identify D. melanogaster ortholog
Basic Local Alignment Search Tool
CSCI 1810 Computational Molecular Biology 2018
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Basic Local Alignment Search Tool
Presentation transcript:

Whole Genome Assembly

WGA 1. Screener 2. Overlapper 3. Unitigger, 4. Scaffolder, 5. Repeat Resolver.

Overlapper...looks for end-to end overlaps of at least 40 bp with no more than 6% differences in match. What’s the significance?...a one in event. Sequencing Fidelity: 99.96%

However...the Screener doesn’t include all of the “low frequency” level repeats,...so, a majority of the Overlapper outputs are bogus.

Unitigger...differentiates between a true overlap, and an overlap that includes more than one loci.

8X...over-collapsed....in a world where real data matches expected data, each loci would have 8X coverage,...if there were repeats, then contigs would be “over-represented”, on average 8 more per repeat.

What Now?... uniquely assembled contigs (unitigs) are readily identifiable, –all of the assembled sequences match over all of the known sequence, - and -...are consistent with an 8x coverage.

Unitigs...contig cluster is consistent with expected size,...no dissimilar sequences between any members....all other contigs are sent to the Discriminator.

Discriminator...parses the “over- collapsed” contig by using sequence outside of the overlap region

Discriminator...may yield unitigs.

Unitigger Output...correctly assembled contigs covering 73.6% of the genome.

Repeat Resolver...most of the remaining gaps were due to repeats. 1. Allow “low Discriminator Value” contigs to fill gaps, 2. Find BAC sequences that unambiguously match outside the nearest unitig, –1 in 10 7 chance of being wrong, 3. Ensure the mate end sequence of candidate BACs match.

If that Doesn’t Work...find a mate-pair that spans the gap, and sequence it, Chromosome Walking...make sequencing primer from BES...

Scaffolder...contigs the contigs, –uses mate-pair information.

WGA Result...91% sequence, 9% gaps,

Compartmentalized Shotgun Assembly Mapping

Scaffolds

Sequence Tagged Sites STS...PCR primers are designed for unique regions of the genome or chromosome,...the chromosome is cut,...assay two PCR products, frequency of co- amplification indicates.

Sequence Tagged Sites STS

Compartmentalized Shotgun Assembly...ideally 24,...really 3845.

92.2 % Sequence 7.8 % Gaps CSA 91 % Sequence 9 % Gaps WPA

PFP Chromosome 21 CSA Green: Same Order, Orientation Yellow: Same Orientation Red: Out of Order, Orientation Blue: Gaps Violations: Red : misoriented Yellow: distance

Chromosome 8 PFP CSA

PFP CSA

Major Public Sequence Databases

281 Curated Data Bases, “... facilitating Biological Discovery”.

What Do We Know? (based on functional group analysis) Science 291 (5507),

Functional Groups 1 st GenBank NR protein database was partitioned into clusters using BLASTP,

Describing Aligned Sequences 2 nd Statistical descriptions of the cluster are developed and tested, Hidden-Markov Markers: statistical descriptions of aligned sequences.

Functional Group Annotation 3 rd Categorization was done by manual review of the family and subfamily names,...by examining SwissProt and GenBank records,...and by review of the literature as well as resources on the World Wide Web.

Outcomes? A relatively small number of structural and functional domains are used in a large number of different proteins, Pfam: 527 families, average length is 275 residues, 456 had “annotated functions”. Nucleic Acids Research 26,

New Genes 4 th Newly sequenced genes are virtually translated, and the predicted proteins are assayed against raw and HMM databases,...significance cut-off levels are determined for each functional group family.