Large-scale genome projects

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

Recombinant DNA technology
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007 Wellcome Trust Medical Photographic Library.
16 and 20 February, 2004 Chapter 9 Genomics Mapping and characterizing whole genomes.
Bacterial Physiology (Micr430)
Sequencing Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Genome sequencing and assembling
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Protein Synthesis Ordinary Level. Lesson Objectives At the end of this lesson you should be able to 1.Outline the steps in protein synthesis 2.Understand.
Genome Analysis Determine locus & sequence of all the organism’s genes More than 100 genomes have been analysed including humans in the Human Genome Project.
Last lecture summary. recombinant DNA technology DNA polymerase (copy DNA), restriction endonucleases (cut DNA), ligases (join DNA) DNA cloning – vector.
Today’s Lecture Genetic mapping studies: two approaches
BioInformatics (2). Physical Mapping - I Low resolution  Megabase-scale High resolution  Kilobase-scale or better Methods for low resolution mapping.
Presentation on genome sequencing. Genome: the complete set of gene of an organism Genome annotation: the process by which the genes, control sequences.
Transcription Transcription is the synthesis of mRNA from a section of DNA. Transcription of a gene starts from a region of DNA known as the promoter.
Mouse Genome Sequencing
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Tomato Chromosome 4: A Mapping & Sequencing Update 28 th September 2005 Christine Nicholson Mapping Core Group Welcome Trust Sanger Institute, UK.
How do you identify and clone a gene of interest? Shotgun approach? Is there a better way?
Phred/Phrap/Consed Analysis A User’s View Arthur Gruber International Training Course on Bioinformatics Applied to Genomic Studies Rio de Janeiro 2001.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
© 2005 Prentice Hall Inc. / A Pearson Education Company / Upper Saddle River, New Jersey Chapter 3 Fundamentals of Mapping and Sequencing Basic principles.
RNA and Protein Synthesis
Genome Sequencing: Technology and Strategies Chuong Huynh Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Steps in a genome sequencing project Funding and sequencing strategy source of funding identified / community drive development of sequencing strategy.
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
A Sequenciação em Análises Clínicas Polymerase Chain Reaction.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
19.1 Techniques of Molecular Genetics Have Revolutionized Biology
Chapter 21 Eukaryotic Genome Sequences
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Double-Ended Shotgun Sequencing of PA14 Daniel G. Lee 10/30/02.
Wageningen, April 24-25, 2008 II Tomato Finishing Workshop Chromosome 12 Update ENEA, Rome University of Naples ‘Federico II’ CRIBI and Univ. of Padua.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Applied Bioinformatics Week 5. Topics Cleaning of Nucleotide Sequences Assembly of Nucleotide Reads.
Human Genome.
GENE SEQUENCING. INTRODUCTION CELL The cells contain the nucleus. The chromosomes are present within the nucleus.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
1 From Mendel to Genomics Historically –Identify or create mutations, follow inheritance –Determine linkage, create maps Now: Genomics –Not just a gene,
Molecular Tools. Recombinant DNA Restriction enzymes Vectors Ligase and other enzymes.
Mojavensis: Issues of Polymorphisms Chris Shaffer GEP 2009 Washington University.
Day Two. DAY TWO 9:00 – 9:10Recap of day one 9:10 – 9:55TOPAAS demo (Sander) 9:55 – 10:15Coffee break 10:30 – 11:30New Technology Data 11:30 – 12:30High.
Accessing and visualizing genomics data
Annotation of eukaryotic genomes
16 th April 2007 Christine Nicholson, Mapping Core Group Wellcome Trust Sanger Institute Tomato Chromosome 4 Mapping & Use of FPC Copyright Wellcome Trust.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Genome sequencing and annotation Week 2 reading assignment - pages 63-78, 93-98, Boxes 2.1 and don’t worry about details of similarity scoring.
Topic Cloning and analyzing oxalate degrading enzymes to see if they dissolve kidney stones with Dr. VanWert.
Cse587A/Bio 5747: L2 1/19/06 1 DNA sequencing: Basic idea Background: test tube DNA synthesis DNA polymerase (a natural enzyme) extends 2-stranded DNA.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
DNA Sequencing First generation techniques
Virginia Commonwealth University
Topics to be covers Basic features present on plasmids
Section 3: Gene Technologies in Detail
Human Cells Gene Expression
The Human Genome Project
Greg Challis Department of Chemistry, University of Warwick, UK
CHAPTER 12 DNA Technology and the Human Genome
A Sequenciação em Análises Clínicas
Introduction to Sequencing
Sequence the 3 billion base pairs of human
Presentation transcript:

Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies employ the same underlying principles: Random Shotgun sequencing

Genomic DNA Shotgun reads Contigs Complete sequence Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

Nucleotide Database Growth

EMBL breakdown by organism

EMBL Release 65

Progress on Large Sequencing Projects

Strategies for sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy How big can you go?? Large-insert clones cosmids 30-40 kb BACs/PACs 50 - 100 kb Whole chromosomes Whole genomes

Genome size and sequencing strategies Genome size (log Mb) 1 2 3 4 H.sapiens (3000 Mb) D.melanogaster (170 Mb) C.elegans (100Mb) P.falciparum (30 Mb) S.cerevisiae (14 Mb) E.coli (4 Mb) Whole genome shotgun (WGS) Clone-by-clone Whole Chromosome Shotgun (WCS) Whole Genome Shotgun (WGS) with Clone ‘skims’

Genomic DNA Shotgun reads Contigs Complete sequence Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

Strategies for sequencing Size and GC composition of genome Volume of data Ease of cloning Ease of sequencing Genome complexity dispersed repetitive sequence telomeres & centromeres Politics/Funding Libraries Sequencing Release Assembly Annotation Closure Strategy

Strategies: Clone by Clone Libraries Sequencing Release Assembly Annotation Closure Strategy Simple (0.5 - 2 K reads) Few problems with repeats Relatively simple informatics Scalability Quality of physical map Fingerprint / STS maps End sequencing

Strategies: Whole Chromosome shotgun (WCS) Libraries Sequencing Release Assembly Annotation Closure Strategy Requires chromosome isolation Moderate complexity (10’s K reads) Problems with repeats Complex informatics Inefficient in isolation Quality of physical map Skims of mapped clones

Strategies: Whole Genome shotgun (WGS) Libraries Sequencing Release Assembly Annotation Closure Strategy Moderate to High complexity (10-100’s K reads) Problems with repeats Complex informatics Quality of physical map Fingerprint map STS markers End-sequences Skims of mapped clones

Sequencing my genome Politics Production Finishing Annotation TIME Libraries Sequencing Release Assembly Annotation Closure Strategy Production Finishing Annotation TIME MONEY

What do you get? DATA!!, DATA !!, and more DATA!! Sequence Libraries Sequencing Release Assembly Annotation Closure Strategy Sequence incomplete v complete First-pass annotation Gene discovery Full annotation A starting point for research

Genome annotation is central to functional genomics Gene Knockout Expression Microarray RNAi phenotypes ORFeome based functional genomics

Sequencing Library construction Colony picking DNA preparation Sequencing reactions Electrophoresis Tracking/Base calling Libraries Sequencing Release Assembly Annotation Closure Strategy

Libraries Essentially Sub-cloning Generation of small insert libraries in a well characterised vector. Ease of propagation Ease of DNA purification e.g. puc18, M13 Libraries Sequencing Release Assembly Annotation Closure Strategy

Libraries - testing Simple concepts Insert/Vector ratio Real data Insert size Sequence …. Simple analysis Libraries Sequencing Release Assembly Annotation Closure Strategy

Sequence generation Pick colonies Template preparation Sequence reactions Standard terminator chemistry pUC libraries sequenced with forward and reverse primers Libraries Sequencing Release Assembly Annotation Closure Strategy

Sequence generation Electrophoresis of products Old style - slab gels, 32 > 64 > 96 lanes New style - capillary gels, 96 lanes Transfer of gel image to UNIX Sequencing machines use a slave Mac/PC Move data to centralised storage area for processing Libraries Sequencing Release Assembly Annotation Closure Strategy

Gel image processing Light-to-Dye estimation Lane tracking Lane editing Trace extraction Trace standardisation Mobility correction Background substitution Libraries Sequencing Release Assembly Annotation Closure Strategy

Pre-processing Base calling using Phred modifies SCF file Quality clipping Vector clipping Sequencing vector Cloning vector Screen for contaminants Feature mark up (repeats/transposons) Libraries Sequencing Release Assembly Annotation Closure Strategy

Finishing Assembly: Process of taking raw single-pass reads into contiguous consensus sequence Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb Libraries Sequencing Release Assembly Annotation Closure Strategy

Genome Assembly Pre-assembly Assembly Automated appraisal Libraries Sequencing Release Assembly Annotation Closure Strategy Pre-assembly Assembly Automated appraisal Manual review

Pre-Assembly Convert to CAF format flatfile text format Libraries Sequencing Release Assembly Annotation Closure Strategy Convert to CAF format flatfile text format choice of assembler choice of post-assembly modules choice of assembly editor www.sanger.ac.uk/Software/CAF

Assembly Assemble using Phrap Libraries Sequencing Release Assembly Annotation Closure Strategy Assemble using Phrap Read fasta & quality scores from CAF file Merge existing Phrap .ace file as necessary Adjust clipping

Assembly appraisal auto-edit removes 70% of read discrepancies Remove cloning vector Mark up sequence features finish Identify low-quality regions Cover using ‘re-runs’ and ‘long-runs’ Compare with current databases plate contamination Libraries Sequencing Release Assembly Annotation Closure Strategy

Manual Assembly appraisal Libraries Sequencing Release Assembly Annotation Closure Strategy Use a sequence editor (GAP/consed) Tools to identify Internal joins Tools to identify and import data from an overlapping projects Tools to check failed or mis-assembled reads for inclusion in project

Manual editing Sanger uses 100% edit strategy Where additional data is required: Check clipping Additional sequencing Template / Primer / Chemistry Assemble new data into project GAP4 Auto-assemble Repeat whole process Libraries Sequencing Release Assembly Annotation Closure Strategy

Manual Quality Checks Force annotation tag consistency All unedited data is re-assembled using Phrap All high-quality discrepancies are reviewed Confirm restriction digest (clones) Check for inverted repeats Manually check: Areas of high-density edits Areas with no supporting unedited data Areas of low read coverage Libraries Sequencing Release Assembly Annotation Closure Strategy

Gap closure Read pairs PCR reactions (long-range / combinatorial) Small-insert libraries Transposon-insertion libraries Libraries Sequencing Release Assembly Annotation Closure Strategy

Gap closure - contig ordering Read pair consistency STS mapping Physical mapping Genetic mapping Optical mapping Large-insert clone skims end-sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy

Annotation DNA features (repeats/similarities) Gene finding Libraries Sequencing Release Assembly Annotation Closure Strategy DNA features (repeats/similarities) Gene finding Peptide features Initial role assignment Others- regulatory regions

Annotation of eukaryotic genomes Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B

Genome analysis overview: C.elegans

DNA features Similarity features mapping repeats simple tandem and inverted repeat families mapping DNA similarities EST/mRNAs in eukaryotes Duplications, RNAs mapping peptide similarities protein similarities Libraries Sequencing Release Assembly Annotation Closure Strategy

Gene finding ORF finding (simple but messy) ab initio prediction Measures of codon bias Simple statistical frequencies Comparative prediction Using similarity data Using cross-species similarities Libraries Sequencing Release Assembly Annotation Closure Strategy

Peptide features Peptide features low-complexity regions Libraries Sequencing Release Assembly Annotation Closure Strategy Peptide features low-complexity regions trans-membrane regions structural information (coiled-coil) Similarities and alignments Protein families (InterPro/COGS)

Initial role assignment Simple attempt to describe the functional identity of a peptide Uses data from: peptide similarities protein families Vital for data mining Large number of predicted genes remain hypothetical or unknown Libraries Sequencing Release Assembly Annotation Closure Strategy

Other regulatory features Libraries Sequencing Release Assembly Annotation Closure Strategy Ribosomal binding sites Promoter regions

Data Release DNA release Unfinished Finished Nucleotide databases GENBANK/EMBL/DDBJ Peptide databases SWISSPROT/TREMBL/GENPEPT Others Libraries Sequencing Release Assembly Annotation Closure Strategy