Genome sequencing and assembling

Slides:



Advertisements
Similar presentations
Sequencing a genome. Definition Determining the identity and order of nucleotides in the genetic material – usually DNA, sometimes RNA, of an organism.
Advertisements

3. Lecture WS 2003/04Bioinformatics III1 Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach.
Lecture 14 Genome sequencing projects
9 Genomics and Beyond Brief Chapter Outline
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
CS273a Lecture 4, Autumn 08, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Genome Sequence Assembly: Algorithms and Issues Fiona Wong Jan. 22, 2003 ECS 289A.
DNA Sequencing – “Plus and Minus” Plus –Incubate with T4 DNA Polymerase and single dNTP –T4 Polymerase degrades 3’ ends in absence of dNTP –Fractionated.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
CS262 Lecture 11, Win07, Batzoglou Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
DNA Sequencing Some Terminology insert a fragment that was incorporated in a circular genome, and can be copied (cloned) vector the circular genome (host)
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
DNA Sequencing. CS273a Lecture 3, Spring 07, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
Assembly.
DNA Sequencing and Assembly
DNA Sequencing.
The Human Genome Race. Collins vs. Venter Collins Venter.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Sequencing and Assembly Cont’d. CS273a Lecture 5, Aut08, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
DNA Sequencing. CS273a Lecture 3, Autumn 08, Batzoglou DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
Comparative Genomics of the Eukaryotes
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
How to Build a Horse Megan Smedinghoff.
Mouse Genome Sequencing
CS 394C March 19, 2012 Tandy Warnow.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
O PTICAL M APPING AS A M ETHOD OF W HOLE G ENOME A NALYSIS M AY 4, 2009 C OURSE : 22M:151 P RESENTED BY : A USTIN J. R AMME.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
Genome sequencing Haixu Tang School of Informatics.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
SIZE SELECT SHEAR Shotgun DNA Sequencing (Technology) DNA target sample LIGATE & CLONE Vector End Reads (Mates) SEQUENCE Primer.
Fragment assembly of DNA A typical approach to sequencing long DNA molecules is to sample and then sequence fragments from them.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Linkage and Mapping. Figure 4-8 For linked genes, recombinant frequencies are less than 50 percent.
Human Genome.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Plan A Topics? 1.Making a probiotic strain of E.coli that destroys oxalate to help treat kidney stones in collaboration with Dr. Lucent and Dr. VanWert.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
ALLPATHS: De Novo Assembly of Whole-Genome Shotgun Microreads
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
DNA Sequencing Project
Genomics Sequencing genomes.
COMPUTATIONAL GENOMICS GENOME ASSEMBLY
Fragment Assembly (in whole-genome shotgun sequencing)
Genome sequence assembly
Pre-genomic era: finding your own clones
Stuff to Do.
CSCI 1810 Computational Molecular Biology 2018
Introduction to Sequencing
Presentation transcript:

Genome sequencing and assembling Chitta Baral

Basic ideas and limitations Current lab techniques can sequence small (say 700 base pairs) DNA pieces. Use restriction enzymes to cut DNA pieces Sort pieces of different sizes using gel electrophoresis and use the sorting to read them Mapping and Walking Sequence one piece, get 700 letters, make a primer that allowed you to read the next 700, and work sequentially down the clone Estimate for human genome sequencing using this method: 100 years Shotgun sequencing (introduced by Sanger et al. 1977) for sequencing genomes Obtain random sequence reads from a genome Assemble them into contigs on the basis of sequence overlaps Straightforward for simple genomes (with no or few repeat sequences) Merge reads containing overlapping sequence Shotgun sequencing is more challenging for complex (repeat-rich) genomes: two approaches

Shotgun sequencing – 2 approaches Hierarchical shotgun approach Generating an overlapping set of intermediate-sized (e.g. bacterial artificial chromosomes with 200 KB inserts) clones, and keeping a map of that (it took 2 yrs for mapping e-coli) Subjecting each of these clones to shotgun sequencing, and using the map to get the whole sequence. Used in S. cerevisiae (yeast), C. elegans (nematode), A. thaliana (mustard weed) and by the International Human Genome Sequencing Consortium (started in 1990, draft made available in 2000) Whole-genome shotgun (WGS) approach Generating sequence reads directly from a whole-genome library Using computational techniques to reassemble in one step. Used for Drosophila melanogaster (fruit fly) and by Celera Genomics (formed 1998) for human genome.

Sequencing small DNA pieces T C -------------- Use DNA cloning or PCR to make multiple copies. Put in 4 testtubes marked G, A, T and C In testtube G use restriction enzymes that cuts at G. Do the above step for the other testubes. Use gel electrophoresis separately for the content in each testtube. The data results in the table on the left. Reading the table we get G has lengths 1, 7, 12, 13, 19; A has lengths 2, 6, 8, 11, 14,15,16; T has length 4, 5, 9, 18 and C has length 3, 10, 17. This gives us the sequence.

The ARACHNE WGS assembler: outline of assembly algorithm Input data: Paired end reads obtained by sequencing both ends of a plasmid of known insert size. Assumes each base in each read has an associated quality score (say one obtained by PHRED program) Quality score q corresponds to the probability 10-q/10 that the base is incorrect (40 corresponds to 99.99% accuracy) Initial step: eliminates terminal regions whose quality is low. Eliminates reads containing very little high-quality sequence Eliminates known vector sequences and known contaminants (eg. Sequence from the bacterial host or cloning vector)

Cont. Overlap detection and alignment Create a sorted table of each k-letter subword (k-mer) together with its source (which read) and its position within the read. Exclude k-mers that occur with extremely high frequency corresponds to highly repeated sequences; used to increase the efficiency of the overlap detection process Identify all instances of read pairs that share one or more overlapping k-mer, and a 3 step process (similar to FASTA) to align the reads effciently (i) Merge overlapping shared k-mers, (ii) Extend the shared k-mers to alignments, (iii) Refine the alignment by dynamic programming. Some valid alignments may be missed and some invalid ones may result.

ARACHNE: Error correction Error detection and correction Generate multiple alignments among overlapping reads Identify instances where a base is overwhelmingly outvoted by bases aligned to it (taking into account the score quality) Similarly correct occasional inserts and deletes (mostly due to sequencing errors)

ARACHNE: Evaluation of alignments Assign a penalty score to each aligned pair of overlapping reads Penalty scores are assigned to each discrepant base, based on the sequence quality score at the base and flanking bases on either side. Discrepancies in high quality sequences are assigned high penalty, and discrepancies in low quality sequences are penalized less heavily. The penalty scores for individual discrepancies are combined to yield an overall penalty score for the alignment. Overlaps incurring too high a penalty are discarded Likely chimeric reads are also detected and discarded Reads that contain genomic sequence from two disparate locations are termed chimeric.

ARACHNE: paired pairs Identification of paired pairs Paired reads: reads which are known to be related with respect to orientation and distance. Searches for instances of two plasmids of similar insert size with sequence overlap occurring at both ends. (together called paired pairs) These instances are extended by building complexes of such pairs *********** ********* ********** ********** ********* ******** Collection of paired pairs are merged together into contigs.

ARACHNE: Contig assembly When repeats are absent: correct assembly can be easily obtained by merging all the overlapping reads. In presence of repeats, false overlaps may arise between reads derived from different copies of a repeat ARACHNE identifies potential repeat boundaries and avoids assembling contigs across such boundaries Potential repeat boundary: a read r can be extended by x and y, but x and y don’t overlap Merge overlapping read pairs that do not cross a marked repeat boundary.

ARACHNE: repeat contigs and supercontigs Detection of repeat contigs – identified 2 ways Unusually high depth of coverage Conflicting links to multiple, distinct, non-overlapping contigs, reflecting the multiple regions that flank the repeat in the genome. aRb, cRd, eRf … will result in –aR-, -cR-, … Creation of supercontigs After marking repeat contigs the unmarked contigs (called unitigs) are assembled. Use forward-reverse links from reads to order and orient unique contigs into supercontigs

ARACHNE: Filling gaps in supercontigs Layout is a set of contigs each of which is an ordered list of contigs with interleaved gaps: corresponding to 2 kind of regions Regions marked as repeat contigs (which were omitted in supercontig construction) Regions for which there are insufficient number of shotgun reads to allow assembly Fill gap using repeat contigs For every pair of consecutive contigs with an interleaving gap in a supercontig S, the program tries to find a path of pairwise overlapping contigs that fill the gap. Forward-reverse links from S guide the construction of the path by identifying contigs likely to fall in the gap.

Consensus derivation and postconsensus merger The layout of overlapping reads is converted into consensus sequence with quality scores. Done by converting pair-wise alignments of reads into multiple alignments, and deriving the consensus base by weighed voting.