Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Sequencing: Technology and Strategies Chuong Huynh Acknowledgement: Daniel Lawson (Sanger Institute) and Jane.

Similar presentations


Presentation on theme: "Genome Sequencing: Technology and Strategies Chuong Huynh Acknowledgement: Daniel Lawson (Sanger Institute) and Jane."— Presentation transcript:

1 Genome Sequencing: Technology and Strategies Chuong Huynh NIH/NLM/NCBIhuynh@ncbi.nlm.nih.gov Acknowledgement: Daniel Lawson (Sanger Institute) and Jane Carlton (TIGR)

2 Bioinformatics Flow Chart 6. Gene & Protein expression data 7. Drug screening Ab initio drug design OR Drug compound screening in database of molecules 8. Genetic variability 1a. Sequencing 1b. Analysis of nucleic acid seq. 2. Analysis of protein seq. 3. Molecular structure prediction 4. molecular interaction 5. Metabolic and regulatory networks

3 How to sequence a genome development of sequencing strategy and source of funding procurement of DNA and initial library construction test sequencing large-scale random sequencing of small (2-3 kb), medium (10 kb) and large (>50 kb) libraries analysis of raw sequence data by: BLAST, RepeatFinder etc release of genome data onto sequencing center website at 8-10 X coverage, random stops closure of sequence gaps and physical gaps comparison to physical map gene model prediction final gene model annotation release of data to GenBank and publication

4 large insert library (20 - 500 kb) Minimal tiling path Genomic DNA Marker1Marker2 shotgun library: small (2-3 kb) and medium (10 kb) Sequencing (8-10 X) Assembly Gap closure gene prediction, annotation and analysis scaffoldcontig Full shotgun sequencing

5 Partial shotgun sequencing Sequencing (5X) Assembly Analysis Genomic DNA contigscaffold shotgun library: small (2-3 kb) and medium (10 kb)

6 Raw sequence: unassembled sequence reads produced from sequencing of inserts from individual recombinant clones of a genomic DNA library. Finished sequence: complete sequence of a genome with no gaps and an accuracy of > 99.9%. Genome coverage: average number of times a nucleotide is represented by a high-quality base in random raw sequence. Full shotgun coverage: genome coverage in random raw sequence required to produce finished sequence, usually 8-10 fold (‘8-10X’). Partial shotgun coverage: typically 3-6X random coverage of a genome which produces sequence data of sufficient quality to enable gene identification but which is not sufficient to produce a finished genome sequence Paired reads: sequence reads determined from both ends of a cloned insert in a recombinant clone. Contig: contiguous DNA sequence produced from joining overlapping raw sequence reads. Singleton: single sequence read that cannot be joined (‘assembled’) into a contig. Scaffold: a group of ordered and orientated contigs known to be physically linked to each other by paired read information. EST: expressed sequence tag generated by sequencing one end of a recombinant clone from a cDNA library. ESTs are single-pass reads and therefore prone to contain sequence errors. GSS: genome survey sequence generated by sequencing one end of a recombinant clone from a genomic DNA library. The genomic DNA library can in some instances be enriched for the presence of coding regions, for example through use of mung bean nuclease digestion of genomic DNA prior to cloning. SNP: single nucleotide polymorphism ORF: open reading frame, stretches of codons in the same reading frame uninterrupted by STOP codons and calculated from a six-frame translation of DNA sequence. Genome sequencing terms

7 Jan 2003

8 NCBI Trace Archive Sep 23, 2003

9 Large-scale genome projects Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies employ the same underlying principles: Random Shotgun sequencing

10 Complete sequence Shotgun reads Contigs Genomic DNA Shearing/Sonication Subclone and Sequence Assembly Finishing Finishing read

11 Strategies for sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy How big can you go?? Large-insert clones cosmids 30-40 kb BACs/PACs 50 - 100 kb Whole chromosomes Whole genomes

12 Genome size and sequencing strategies Genome size (log Mb) D.melanogaster (170 Mb) C.elegans (100Mb) H.sapiens (3000 Mb) S.cerevisiae (14 Mb) E.coli (4 Mb) P.falciparum (30 Mb) 0123 4 Whole genome shotgun (WGS) Whole Chromosome Shotgun (WCS) Clone-by-clone Whole Genome Shotgun (WGS) with Clone ‘skims’

13 Complete sequence Shotgun reads Contigs Genomic DNA Shearing/Sonication Subclone and Sequence Assembly Finishing Finishing read

14 Strategies for sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy Size and GC composition of genome Volume of data Ease of cloning Ease of sequencing Genome complexity dispersed repetitive sequence telomeres & centromeres Politics/Funding

15 Strategies: Clone by Clone Libraries Sequencing Release Assembly Annotation Closure Strategy Simple (0.5 - 2 K reads) Few problems with repeats Relatively simple informatics Scalability Quality of physical map Fingerprint / STS maps End sequencing

16 Strategies: Whole Chromosome shotgun (WCS) Libraries Sequencing Release Assembly Annotation Closure Strategy Requires chromosome isolation Moderate complexity (10’s K reads) Problems with repeats Complex informatics Inefficient in isolation Quality of physical map (want good physical map) Skims of mapped clones

17 Strategies: Whole Genome shotgun (WGS) Libraries Sequencing Release Assembly Annotation Closure Strategy Moderate to High complexity (10-100’s K reads) Massive Problems with repeats Complex informatics Quality of physical map Fingerprint map STS markers End-sequences Skims of mapped clones

18 Sequencing my genome Libraries Sequencing Release Assembly Annotation Closure Strategy Annotation Finishing Production Politics TIMEMONEY

19 What do you get? Libraries Sequencing Release Assembly Annotation Closure Strategy Sequence incomplete complete First-pass annotation Gene discovery Full annotation A starting point for research DATA!!, DATA !!, and more DATA!!

20 Genome annotation is central to functional genomics Gene Knockout Expression Microarray RNAi phenotypes ORFeome based functional genomics

21 Where is the problem? Most genome will be sequenced and can be sequenced; few problem are unsolvable. Most genome will be sequenced and can be sequenced; few problem are unsolvable. Problems lies in understanding what you have: Problems lies in understanding what you have: gene prediction gene prediction annotation annotation

22 Sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy Library construction Colony picking (random) DNA preparation (isolate DNA) Sequencing reactions Electrophoresis Tracking/Base calling

23 Libraries Libraries Sequencing Release Assembly Annotation Closure Strategy Essentially Sub-cloning Generation of small insert libraries in a well characterised vector. Ease of propagation Ease of DNA purification e.g. puc18, M13

24 Libraries - testing Libraries Sequencing Release Assembly Annotation Closure Strategy Simple concepts Insert/Vector ratio (Blue/White ratio) Real data Insert size Sequence …. Simple analysis

25 Sequence generation Libraries Sequencing Release Assembly Annotation Closure Strategy Pick colonies  growth medium Template preparation (DNA isolation) Sequence reactions Standard terminator chemistry pUC libraries sequenced with forward and reverse primers Tracking and noise

26 Sequence generation Libraries Sequencing Release Assembly Annotation Closure Strategy Electrophoresis of products Old style - slab gels, 32 > 64 > 96 lanes New style - capillary gels, 96 lanes Transfer of gel image to UNIX Sequencing machines use a slave Mac/PC Move data to centralised storage area for processing

27 Gel image processing Libraries Sequencing Release Assembly Annotation Closure Strategy Light-to-Dye estimation Lane tracking Lane editing Trace extraction Trace standardisation Mobility correction Background substitution

28 Pre-processing Libraries Sequencing Release Assembly Annotation Closure Strategy Base calling using Phred modifies SCF file format Quality clipping from Phred Vector clipping Sequencing vector Cloning vector Screen for contaminants Feature mark up (repeats/transposons)

29 Finishing Libraries Sequencing Release Assembly Annotation Closure Strategy Assembly: Process of taking raw single-pass reads into contiguous consensus sequence (Phred/Phrap) Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb

30 Genome Assembly Libraries Sequencing Release Assembly Annotation Closure Strategy Pre-assembly (assembly algorithm) Assembly Automated appraisal Manual review

31 Pre-Assembly Libraries Sequencing Release Assembly Annotation Closure Strategy Convert to CAF format flatfile text format choice of assembler choice of post-assembly modules choice of assembly editor www.sanger.ac.uk/Software/CAF

32 Assembly Libraries Sequencing Release Assembly Annotation Closure Strategy Assemble using Phrap Read fasta & quality scores from CAF file Merge existing Phrap.ace file (previous assembly) as necessary Adjust clipping (where vector, quality start)

33 Assembly appraisal Libraries Sequencing Release Assembly Annotation Closure Strategy auto-edit removes 70% of read discrepancies of seq. assembly (highlight misassembly); manually Remove cloning vector Mark up sequence features (for finisher) “Finish” Program (or Program “AutoFinish”) Identify low-quality regions Cover using ‘re-runs’ and ‘long-runs’ Compare with current databases plate contamination

34 Manual Assembly appraisal Manual Assembly appraisal Libraries Sequencing Release Assembly Annotation Closure Strategy Use a sequence editor (GAP/consed) Tools to identify Internal joins Tools to identify and import data from an overlapping projects Tools to check failed or mis-assembled reads for inclusion in project

35 Manual editing Libraries Sequencing Release Assembly Annotation Closure Strategy Sanger uses 100% edit strategy Where additional data is required: Check clipping Additional sequencing Template / Primer / Chemistry Assemble new data into project GAP4 Auto-assemble Repeat whole process

36 Manual Quality Checks Libraries Sequencing Release Assembly Annotation Closure Strategy Force annotation tag consistency All unedited data is re-assembled using Phrap All high-quality discrepancies are reviewed Confirm restriction digest (clones) Check for inverted repeats Manually check: Areas of high-density edits Areas with no supporting unedited data Areas of low read coverage (need to confirm)

37 Gap closure Libraries Sequencing Release Assembly Annotation Closure Strategy Read pairs PCR reactions (long-range / combinatorial) Small-insert libraries Transposon-insertion libraries

38 Gap closure - contig ordering Libraries Sequencing Release Assembly Annotation Closure Strategy Read pair consistency STS mapping Physical mapping Genetic mapping Optical mapping Large-insert clone skims end-sequencing

39 Annotation Libraries Sequencing Release Assembly Annotation Closure Strategy DNA features (repeats/similarities) Gene finding Peptide features Initial role assignment Others- regulatory regions

40 Annotation of eukaryotic genomes transcription RNA processing translation AAAAAAA Genomic DNA Unprocessed RNA Mature mRNA Nascent polypeptide folding Reactant A Product B Function Active enzyme ab initio gene prediction Comparative gene prediction Functional identification Gm 3

41 Genome analysis overview: C.elegans

42 DNA features Libraries Sequencing Release Assembly Annotation Closure Strategy Similarity features mapping repeats simple tandem and inverted repeat families mapping DNA similarities EST/mRNAs in eukaryotes Duplications, RNAs mapping peptide similarities protein similarities

43 Gene finding Libraries Sequencing Release Assembly Annotation Closure Strategy ORF finding (simple but messy) ab initio prediction Measures of codon bias Simple statistical frequencies Comparative prediction Using similarity data Using cross-species similarities

44 Peptide features Libraries Sequencing Release Assembly Annotation Closure Strategy Peptide features low-complexity regions trans-membrane regions structural information (coiled-coil) Similarities and alignments Protein families (InterPro/COGS)

45 Initial role assignment Libraries Sequencing Release Assembly Annotation Closure Strategy Simple attempt to describe the functional identity of a peptide Uses data from: peptide similarities protein families Vital for data mining Large number of predicted genes remain hypothetical or unknown

46 Other regulatory features Libraries Sequencing Release Assembly Annotation Closure Strategy Ribosomal binding sites Promoter regions

47 Data Release Libraries Sequencing Release Assembly Annotation Closure Strategy DNA release Unfinished Finished Nucleotide databases GENBANK/EMBL/DDBJ Peptide databases SWISSPROT/TREMBL/GENPEPT Others

48 Real World Example: Malaria Genome Project If time permits.

49 Four species of malaria infect man: Plasmodium falciparum P. vivax P. malariae P. ovale Four species of malaria infect rodents: P. yoelii P. berghei P. chabaudi P. vinckei Sequencing the Plasmodium genomes

50 Plasmodium falciparum ~30 million base pairs (Mb) 80% (A+T) 14 chromosomes DNA “unstable” in E. coli No large insert DNA clones suitable for sequencing Too large for whole genome shotgun (‘96) Whole chromosome shotgun strategy was selected

51

52 FeatureP.y.yoelii P.falciparum Size (Mb)23.122.9 No. chroms1414 Coverage (fold)514.5 No. gaps5,81293 (G+C) content (%)22.619.4 No. genes5,8785,268 Mean gene length (bp)1,2982,283 Gene density (bp/gene)2,5664,338 Genes with introns (%)54.253.9 Genes with ESTs (%)48.949.1 Genes with proteomic data (%)18.251.8 Exons: Mean no./gene2.02.4 (G+C) content (%)24.823.7 Introns: (G+C) content21.113.5 Intergenic sequences: (G+C) content20.713.6 RNAs: no. tRNAs3943 no. 5s rRNAs33 no. rRNA units47 Comparison of genome features

53 P. falciparum genome status ChrSize (bp)No. gapsFold coverage 1643,293013.3 2 (TIGR)947,102011.1 31,060,087010.9 41,204,112016.8 51,343,552015.1 61,377,956816.8 71,350,4521415.8 81,323,1952416.2 91,541,723017.9 10 (TIGR)1,694,445415.6 11 (TIGR)2,035,250311.3 12 (Stanford)2,271,477016.3 132,747,3273717.2 14 (TIGR)3,291,00639.2 022,7880ND 22,853,7649314.5

54 Eukaryotic annotation - TIGR EGC Annotation Station/Manatee DDS/DPS Annotation DB Project DB Functional assignments BLAST PFAM/TIGRFAM SignalP/TMHMM Gene models Gene finders Alignments of genomic to proteins and ESTs

55 PFB0680w

56

57 The P. falciparum genome

58 Distribution of gene lengths 15.5% 3.0-3.6%

59 The P. falciparum proteome

60 52% of predicted gene products detected by proteomics Florens et al. Nature 419:520-526

61 Metabolism and transport Analysis based on similarity searches with sequences of known enzymes 14% (733) of genes encoded enzymes Lower than in bacterial genomes (25-33%) Enzymes more difficult to identify due to AT-rich genome and evolutionary distance between P.f. and other sequenced organisms Or P.f. has smaller proportion of genome devoted to enzymes, reduced metabolic potential

62

63 Analysis of transporters in P. falciparum

64 Organization of multi-gene families in P. falciparum

65

66 P. falciparum Genome Summary FeatureValueComments Genome size24 million base pairs 1% of the human genome Number of chromosomes1423 pairs Number of gaps93 (0-37 per chr)Genome >98% complete (A+T) content~ 80.6% Most (A+T) rich genome sequenced to date Number of genes~5,300 Yeast: 5,770 Human: ~35,000 Proteins of unknown function 60% More than other genomes Possible surface proteins~900Test for use in vaccines Gene products detected by proteomics 52% See Florens et al. See Lasonder et al. Genes conserved in rodent malaria P. yoelii yoelii 60%See Carlton et al.

67 Extra Slides

68


Download ppt "Genome Sequencing: Technology and Strategies Chuong Huynh Acknowledgement: Daniel Lawson (Sanger Institute) and Jane."

Similar presentations


Ads by Google