How to Build a Horse Megan Smedinghoff.

Slides:



Advertisements
Similar presentations
Celera Assembler Arthur L. Delcher Senior Research Scientist CBCB University of Maryland.
Advertisements

Welcome to Introduction to Bioinformatics Wednesday, 10 February Genome Sequencing/Assembly Genome sequencing/Assembly Click anywhere to go on to the next.
Genome Assembly: a brief introduction
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Lecture 14 Genome sequencing projects
9 Genomics and Beyond Brief Chapter Outline
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Assembly.
The Human Genome Race. Collins vs. Venter Collins Venter.
CS273a Lecture 4, Autumn 08, Batzoglou Fragment Assembly (in whole-genome shotgun sequencing) CS273a Lecture 5.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
Genome sequencing and assembling
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Sequencing a genome and Basic Sequence Alignment
Assembling Genomes BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
De-novo Assembly Day 4.
Genomic sequencing and its data analysis Dong Xu Digital Biology Laboratory Computer Science Department Christopher S. Life Sciences Center University.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
CS 394C March 19, 2012 Tandy Warnow.
A hierarchical approach to building contig scaffolds Mihai Pop Dan Kosack Steven L. Salzberg Genome Research 14(1), pp , 2004.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Next generation sequence data and de novo assembly For human genetics By Jaap van der Heijden.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Sequencing a genome and Basic Sequence Alignment
Problems of Genome Assembly James Yorke and Aleksey Zimin University of Maryland, College Park 1.
Vervet Monkey Genomics: Genome Canada and Génome Québec Physical Map Project J. Wasserscheid, G. Leveque, C. Nagy, C. Pinsonnault, and K. Dewar, McGill.
Human Genome.
The Wellcome Trust Sanger Institute
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
1. Assembly by alignment Instead of overlap-layout-consensus we use alignment-consensus 2.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Notes: Human Genome (Right side page)
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Virginia Commonwealth University
DNA Sequencing Project
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Jeong-Hyeon Choi, Sun Kim, Haixu Tang, Justen Andrews, Don G. Gilbert
Genome sequence assembly
Professors: Dr. Gribskov and Dr. Weil
Stuff to Do.
CS 598AGB Genome Assembly Tandy Warnow.
How to Build a Horse: Final Report
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Sequence the 3 billion base pairs of human
AMOS Assembly Validation and Visualization
Assembling Genomes BCH339N Systems Biology / Bioinformatics – Spring 2016 Edward Marcotte, Univ of Texas at Austin.
Presentation transcript:

How to Build a Horse Megan Smedinghoff

Background In February 2007, Broad Institute released a draft genome of the horse (Equus caballus) The project cost $15 million and was funded by the National Human Genome Research Institute and the National Institute of Health 300,000 Bacterial Artificial Chromosomes were provided by the University of Veterinary Medicine in Hanover, Germany and the Helmholtz Centre for Infection Research in Braunschweig, Germany

Horse Genome Statistics The horse genome contains approximately 2.7 billion base pairs The assembly was done using 6.8-fold coverage The sequenced horse was a thoroughbred mare named Twilight from Cornell University Twilight posing for a picture at Cornell

Why Sequence the Horse? Allows scientists to study diseases that primarily affect horses such as Glanders SNP information can be used to connect DNA to physical characteristics and explain differences between breeds Lots of general information about mammals can be gained by looking at the horse since very few large mammals have been sequenced

How the Horse Genome Affects Us There are over 80 known genetic conditions in the horse that are analogous to human disorders Horses have some conditions traditionally found in humans such as allergies and arthritis Having the complete horse genome helps infer the order of evolution Horse Racing?

Project Proposal Reassemble the horse genome using the Celera Assembler Use existing UMD software to compare my assembly with the Broad assembly and produce a reconciled horse genome Deposit the improved assembly in GenBank Advisor: Jim Yorke

Introduction to Genome Sequencing DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. Primer End Reads (Mates) SEQUENCE 750bp Vector LIGATE & CLONE Slide courtesy of Art Delcher

How Genomes are Assembled Closure Trim the Reads Calculate Overlaps Build Unitigs Build Contigs Build Scaffolds

Assembly: Calculating Overlaps 5’ 3’ Read A 5’ 3’ Read A 5’ 3’ Read B 3’ 5’ Read B 3’ 5’ Read A 3’ 5’ Read A 5’ 3’ Read B 3’ 5’ Read B Compare every possible combination of reads to find every overlap of a certain length (~40bp) Must compare forward and reverse orientation of each pair of reads Comparisons must take into account the possibility of sequencing errors and use alignment algorithms such as Smith-Waterman

Assembly: Creating Unitigs Reads A unitig is a set of reads that have been linked together based on overlaps A unitig has no ambiguities

Assembly: Creating Unitigs (cont.) Best Buddy Algorithm for Unitig Assembly: If the longest overlap with read A is read B and the longest overlap with read B is read A, then reads A and B are best buddies A B C A B C D D Read A and Read B are best buddies Read A and Read B are NOT best buddies

Assembly: Creating Contigs Unitig A Unitig B Read 1 Read 2 Read 1 and Read 2 are mates A contig is a set of overlapping unitigs Contigs are assembled by using mate pair information Since we know the distance between mates and the orientation of the mates, we can infer the placement of the unitigs

Assembly: Building Scaffolds Contig A Contig B Reads Scaffolds are built from contigs The orientation and approximate distances between contigs are inferred from mate pair information When possible, the gaps between contigs are filled in with leftover sequence

Arachne Assembler 24-mer indexing Any two reads that share at least one 24-mer are paired Each pair is scored Contigs are created by merging paired pairs Repeat regions are avoided during contig assembly but used during scaffold assembly Subreads are placed after scaffold assembly Serafim Batzoglou Arachne Author

Celera Assembler Find overlaps of at least 40bp with less than 6% error Overlaps are found using 22-mers After overlaps are calculated, Celera does error correction using a voting algorithm Contigs are assembled using best buddy algorithm Scaffolds are assembled from mate pair information Scaffold gaps are filled when possible Gene Meyers Former vice president of Celera Genomics

Project Expectations Fall 2007 Produce Celera Assembly Spring 2008 Produce Reconciled Assembly General Goals Tackle the unexpected problems that accompany genome assembly Document my work Validate my work wherever possible

Validation Genome assemblies are not perfect I plan to validate my assembly by comparing it to the current draft I expect about 1.5% difference between the Celera Assembly and the Broad Assembly I will use Mummer to measure similarity between genomes

Graphs courtesy of Adam Phillippy Mummer Mummer is a piece of software created by CBCB that is used to compare genomes Mummer locates strings of at least 18bp that are present in each genome Plotting the results makes it easy to see insertions, deletions, inversions, etc. Graphs courtesy of Adam Phillippy

Implementation Details I plan to use the Genome cluster at University of Maryland to produce my assembly Much of my project will utilize existing software I intend to use Perl to write any additional scripts that may be needed

Time Permitting The University of Maryland has recently produced a lot of software for the genome assembly pipeline, much of which has not been tested on large genomes I hope to use programs like the UMD overlapper and Figaro to see how these programs affect my assembly Mihai Pop James White

Acknowledgements James Yorke, Aleksey Zimin, and the Genome Group for advising me on the nature of this project Steven Salzberg, Art Delcher, and Adam Phillippy for giving lectures and producing slides on genome assembly topics Gene Myers paper on Drosophila Serafim Batzoglou paper on Arachne Wikipedia