Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-scale genome projects

Similar presentations


Presentation on theme: "Large-scale genome projects"— Presentation transcript:

1 Large-scale genome projects
Libraries Sequencing Release Assembly Annotation Closure Strategy Sequencing DNA molecules in the Mb size range All strategies employ the same underlying principles: Random Shotgun sequencing

2 Genomic DNA Shotgun reads Contigs Complete sequence
Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

3 Nucleotide Database Growth

4 EMBL breakdown by organism

5 EMBL Release 65

6 Progress on Large Sequencing Projects

7 Strategies for sequencing
Libraries Sequencing Release Assembly Annotation Closure Strategy How big can you go?? Large-insert clones cosmids kb BACs/PACs kb Whole chromosomes Whole genomes

8 Genome size and sequencing strategies
Genome size (log Mb) 1 2 3 4 H.sapiens (3000 Mb) D.melanogaster (170 Mb) C.elegans (100Mb) P.falciparum (30 Mb) S.cerevisiae (14 Mb) E.coli (4 Mb) Whole genome shotgun (WGS) Clone-by-clone Whole Chromosome Shotgun (WCS) Whole Genome Shotgun (WGS) with Clone ‘skims’

9 Genomic DNA Shotgun reads Contigs Complete sequence
Shearing/Sonication Subclone and Sequence Shotgun reads Assembly Contigs Finishing read Finishing Complete sequence

10 Strategies for sequencing
Size and GC composition of genome Volume of data Ease of cloning Ease of sequencing Genome complexity dispersed repetitive sequence telomeres & centromeres Politics/Funding Libraries Sequencing Release Assembly Annotation Closure Strategy

11 Strategies: Clone by Clone
Libraries Sequencing Release Assembly Annotation Closure Strategy Simple ( K reads) Few problems with repeats Relatively simple informatics Scalability Quality of physical map Fingerprint / STS maps End sequencing

12 Strategies: Whole Chromosome shotgun (WCS)
Libraries Sequencing Release Assembly Annotation Closure Strategy Requires chromosome isolation Moderate complexity (10’s K reads) Problems with repeats Complex informatics Inefficient in isolation Quality of physical map Skims of mapped clones

13 Strategies: Whole Genome shotgun (WGS)
Libraries Sequencing Release Assembly Annotation Closure Strategy Moderate to High complexity (10-100’s K reads) Problems with repeats Complex informatics Quality of physical map Fingerprint map STS markers End-sequences Skims of mapped clones

14 Sequencing my genome Politics Production Finishing Annotation TIME
Libraries Sequencing Release Assembly Annotation Closure Strategy Production Finishing Annotation TIME MONEY

15 What do you get? DATA!!, DATA !!, and more DATA!! Sequence
Libraries Sequencing Release Assembly Annotation Closure Strategy Sequence incomplete v complete First-pass annotation Gene discovery Full annotation A starting point for research

16 Genome annotation is central to functional genomics
Gene Knockout Expression Microarray RNAi phenotypes ORFeome based functional genomics

17

18

19 Sequencing Library construction Colony picking DNA preparation
Sequencing reactions Electrophoresis Tracking/Base calling Libraries Sequencing Release Assembly Annotation Closure Strategy

20 Libraries Essentially Sub-cloning
Generation of small insert libraries in a well characterised vector. Ease of propagation Ease of DNA purification e.g. puc18, M13 Libraries Sequencing Release Assembly Annotation Closure Strategy

21 Libraries - testing Simple concepts Insert/Vector ratio Real data
Insert size Sequence …. Simple analysis Libraries Sequencing Release Assembly Annotation Closure Strategy

22 Sequence generation Pick colonies Template preparation
Sequence reactions Standard terminator chemistry pUC libraries sequenced with forward and reverse primers Libraries Sequencing Release Assembly Annotation Closure Strategy

23 Sequence generation Electrophoresis of products
Old style - slab gels, 32 > 64 > 96 lanes New style - capillary gels, 96 lanes Transfer of gel image to UNIX Sequencing machines use a slave Mac/PC Move data to centralised storage area for processing Libraries Sequencing Release Assembly Annotation Closure Strategy

24 Gel image processing Light-to-Dye estimation Lane tracking
Lane editing Trace extraction Trace standardisation Mobility correction Background substitution Libraries Sequencing Release Assembly Annotation Closure Strategy

25 Pre-processing Base calling using Phred modifies SCF file
Quality clipping Vector clipping Sequencing vector Cloning vector Screen for contaminants Feature mark up (repeats/transposons) Libraries Sequencing Release Assembly Annotation Closure Strategy

26

27 Finishing Assembly: Process of taking raw single-pass reads into contiguous consensus sequence Closure: Process of ordering and merging consensus sequences into a single contiguous sequence Finished is defined as sequenced on both strands using multiple clones. In the absence of multiple clones the clone must be sequenced with multiple chemistries. The overall error rate is estimated at less than 1 error per 10 kb Libraries Sequencing Release Assembly Annotation Closure Strategy

28 Genome Assembly Pre-assembly Assembly Automated appraisal
Libraries Sequencing Release Assembly Annotation Closure Strategy Pre-assembly Assembly Automated appraisal Manual review

29 Pre-Assembly Convert to CAF format flatfile text format
Libraries Sequencing Release Assembly Annotation Closure Strategy Convert to CAF format flatfile text format choice of assembler choice of post-assembly modules choice of assembly editor

30 Assembly Assemble using Phrap
Libraries Sequencing Release Assembly Annotation Closure Strategy Assemble using Phrap Read fasta & quality scores from CAF file Merge existing Phrap .ace file as necessary Adjust clipping

31 Assembly appraisal auto-edit removes 70% of read discrepancies
Remove cloning vector Mark up sequence features finish Identify low-quality regions Cover using ‘re-runs’ and ‘long-runs’ Compare with current databases plate contamination Libraries Sequencing Release Assembly Annotation Closure Strategy

32 Manual Assembly appraisal
Libraries Sequencing Release Assembly Annotation Closure Strategy Use a sequence editor (GAP/consed) Tools to identify Internal joins Tools to identify and import data from an overlapping projects Tools to check failed or mis-assembled reads for inclusion in project

33 Manual editing Sanger uses 100% edit strategy
Where additional data is required: Check clipping Additional sequencing Template / Primer / Chemistry Assemble new data into project GAP4 Auto-assemble Repeat whole process Libraries Sequencing Release Assembly Annotation Closure Strategy

34 Manual Quality Checks Force annotation tag consistency
All unedited data is re-assembled using Phrap All high-quality discrepancies are reviewed Confirm restriction digest (clones) Check for inverted repeats Manually check: Areas of high-density edits Areas with no supporting unedited data Areas of low read coverage Libraries Sequencing Release Assembly Annotation Closure Strategy

35 Gap closure Read pairs PCR reactions (long-range / combinatorial)
Small-insert libraries Transposon-insertion libraries Libraries Sequencing Release Assembly Annotation Closure Strategy

36 Gap closure - contig ordering
Read pair consistency STS mapping Physical mapping Genetic mapping Optical mapping Large-insert clone skims end-sequencing Libraries Sequencing Release Assembly Annotation Closure Strategy

37

38 Annotation DNA features (repeats/similarities) Gene finding
Libraries Sequencing Release Assembly Annotation Closure Strategy DNA features (repeats/similarities) Gene finding Peptide features Initial role assignment Others- regulatory regions

39 Annotation of eukaryotic genomes
Genomic DNA ab initio gene prediction transcription Unprocessed RNA RNA processing Mature mRNA Gm3 AAAAAAA Comparative gene prediction translation Nascent polypeptide folding Active enzyme Functional identification Function Reactant A Product B

40 Genome analysis overview: C.elegans

41 DNA features Similarity features mapping repeats
simple tandem and inverted repeat families mapping DNA similarities EST/mRNAs in eukaryotes Duplications, RNAs mapping peptide similarities protein similarities Libraries Sequencing Release Assembly Annotation Closure Strategy

42 Gene finding ORF finding (simple but messy) ab initio prediction
Measures of codon bias Simple statistical frequencies Comparative prediction Using similarity data Using cross-species similarities Libraries Sequencing Release Assembly Annotation Closure Strategy

43 Peptide features Peptide features low-complexity regions
Libraries Sequencing Release Assembly Annotation Closure Strategy Peptide features low-complexity regions trans-membrane regions structural information (coiled-coil) Similarities and alignments Protein families (InterPro/COGS)

44 Initial role assignment
Simple attempt to describe the functional identity of a peptide Uses data from: peptide similarities protein families Vital for data mining Large number of predicted genes remain hypothetical or unknown Libraries Sequencing Release Assembly Annotation Closure Strategy

45 Other regulatory features
Libraries Sequencing Release Assembly Annotation Closure Strategy Ribosomal binding sites Promoter regions

46

47 Data Release DNA release Unfinished Finished Nucleotide databases
GENBANK/EMBL/DDBJ Peptide databases SWISSPROT/TREMBL/GENPEPT Others Libraries Sequencing Release Assembly Annotation Closure Strategy


Download ppt "Large-scale genome projects"

Similar presentations


Ads by Google