Presentation is loading. Please wait.

Presentation is loading. Please wait.

DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005.

Similar presentations


Presentation on theme: "DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005."— Presentation transcript:

1 DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005

2 DM Church- NCBI Of mice and men

3 DM Church- NCBI Fleischman et al. (1991) PNAS 88:10885-10889 Both carry mutations in the Kit gene. Of mice and men

4 DM Church- NCBI The Basic Model Gene Structure Mature Peptide ProPeptide mRNA Transcript Chromosome Resources (Maps, Clones, etc) Genomes Organisms Function/Phenotype Disease

5 DM Church- NCBI Why sequence? I.Complete ‘parts’ list for a given organism: Genes, promoters, regulatory regions, variation, ???? High quality, finished (or ‘essentially finished’) sequence II. Genes, Genes, Genes… Draft is probably good enough III. Annotating a finished genome (Human, soon to be mouse) Low coverage (2X sequence coverage).

6 DM Church- NCBI What data is represented in GenBank… Data in GenBank is an interpretation of primary sequence data  Sequence reaction  Read gel/call chromatogram (Phred/TraceTuner)  Submit sequence Steps for small, single pass sequence  Assemble sequence and submit consensus (Phrap, CAP3, CAP4) Last step for large molecules (BAC, fosmids, long cDNAs)

7 DM Church- NCBI Getting the raw data NCBI Vladimir Alekseyev Alexey Egorov Anton Butanayev Sergiy Ponomarev Eugene Yaschenko Deanna Church Breadth * 237 defined species * 2 environmental sample Depth * 9 Drosophilas * 3 Canis * 5 Aspergillus >500 Million Traces (and counting…) http://www.ncbi.nlm.nih.gov/Traces/

8 DM Church- NCBI Getting the raw data NCBI Vladimir Alekseyev Alexey Egorov Anton Butanayev Sergiy Ponomarev Eugene Yaschenko Deanna Church And they just keep coming…

9 DM Church- NCBI Getting the raw data NCBI Vladimir Alekseyev Alexey Egorov Anton Butanayev Sergiy Ponomarev Eugene Yaschenko Deanna Church Scripted access for bulk retrieval

10 DM Church- NCBI Genome Sequencing Strategies Not all bases are created equal Phred quality scores: Measures the probability that a base is incorrect. If a base has a 1/1000 probability of being incorrect, it has a Phred score of 30. 20 Quality scores Base

11 DM Church- NCBI Science (June, 1998) Craig Venter Private and public efforts… Science (September, 1998)

12 DM Church- NCBI BAC insert BAC vector Shotgun sequence Assemble This part is relatively cheap and easy Fold sequence Gaps deeper sequence coverage rarely resolves all gaps GAPS This part is hard and expensive “finishers” go in to manually fill the gaps, often by PCR Putting Genomes Together Hierarchical Shotgun Assembly 200 Kb BAC 0.5 Kb/read 400 reads = 1X 2000 reads = 5X

13 DM Church- NCBI HTGS keywords htgs_phase0: low coverage sequence 1-2X htgs_phase1: generally 4-5X sequence coverage, several fragments not ordered or oriented htgs_phase2: sequence coverage can vary (generally 5-10X) but fragments are ordered and oriented. htgs_phase3: highly accurate, finished sequence. Error rate <10 -5 Draft sequence: phase 1 or 2, but >90% of the bases are high quality (phred 20 or better) htgs_active_fin: center has finished shotgun phase and moved to finishing htgs_cancelled: sequencing has discontinued on this clone

14 DM Church- NCBI The Raw Data

15 DM Church- NCBI - Remove contaminants (vector, E. coli, other organisms, virus) - Bin clones by chromosome arm - Incorporate clone order information using TPF - Identify fragment overlaps -Determine fragment order and orientation, remove sequence redundancy (This produces sequence contigs given NT_XXXXXX type accession numbers) - Place contigs on chromosome UCSC Jim Kent NCBI Paul Kitts Greg Schuler Richa Agarwala Putting genomes together

16 DM Church- NCBI Overlapping draft clones Break clones into constituent fragments Reassemble using sequence overlaps Order using ESTs, mRNAs, plasmid ends, curation When BAC clones overlap, the sequence can be made non-redundant. These “contigs” are given NT_XXXXXX accession numbers UCSC Jim Kent NCBI Paul Kitts Greg Schuler Richa Agarwala Putting genomes together

17 DM Church- NCBI STS marker D6S1606 forward primer reverse primer microsatellite PCR product size: 92 - 100 bases GAGTTTGCACCATTGCACTCCAGCCTGGGCAAC (CA)n AACGTGGCATGTGCCTGTACTCTCC CTCAAACGTGGTAACGTGAGGTCGGACCCGTTG (GT)n TTGCACCGTACACGGACATGAGAGG A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989). A common language for physical mapping of the human genome M. Olson, L. Hood, C. Cantor, and D. Botstein Science 245, 1434-1435 (1989). Sequence Tagged Sites (STS)

18 DM Church- NCBI The Original Genome Resources- STS Maps genome meiosis- genetic radiation- RH clones- clone based meiosis- genetic radiation- RH clones- clone based fragment - each line represents an individual cell line/animal that carries a particular break - STSs can be amplified from DNA in these cell lines/animals - based on cell line/animal marker content, the breaks can be determined and the markers ordered. 129 water 2468101214161820 hamster 1 35791113151719212224262830 D2Wsu129e

19 DM Church- NCBI Electronic PCR (e-PCR) STS marker D6S1606 forward primer reverse primer microsatellite repeat PCR product size: 92 - 100 bases GAGTTTGCACCATTGCACTCCAGCCTGGGCAACAAGAGTGAAACTCTGTCACAGA (CA)n AACGTGGCATGTGCCTGTACTCTC CTCAAACGTGGTAACGTGAGGTCGGACCCGTTGTTCTCACTTTGAGACAGTGTCT (GT)n TTGCACCGTACACGGACATGAGAG E-PCR software searches DNA sequences for exact matches to both primers in correct order, orientation, and spacing to be consistent with known PCR product size. Schuler (1997), Genome Research 7, 541-550

20 DM Church- NCBI Electronic PCR (e-PCR) http://www.ncbi.nlm.nih.gov/sutils/e-pcr/

21 DM Church- NCBI A B C D E F G H I J K L M N O A B C D F G H K L O N Ideally… Non-sequence based Map (flip) A B C D F G H K L O N Putting genomes together

22 DM Church- NCBI More like… A B C D E F G H I J K L M N O A B C Z Y X W H J M V N O A B H I J C D Y L M N O A B H I J L M N O ? Putting genomes together

23 DM Church- NCBI The Starting Material: Phase 1 Phase 2 Phase 3 number 10632 777 30470 Length (Kb) 1726.24 101.11 3621.30 http://www.ncbi.nlm.nih.gov/genome/guide/human/HsStats.html Framework assemblies: 388 contigs- 3.02 Gb Type of source sequence Number usedLength (bp) Draft only4610,284,900 Finished only3342,833,780,000 Contig Information: Human assembly: Build 35 Assembly is now defined by AGP* files rather than a formal assembly process. These are maintained by chromosome coordinators. *AGP= A Golden Path Reference Contig N50 # : 38.5 Mb # N50 length: Contig length at which 50% of the bases in the assembly reside in a contig of at least that size.

24 DM Church- NCBI Range in kbNumberLength (kb) Percent of total <30021830,2761 300-10007444,0281.45 1000-500087208,3656.89 >50001192,737,63090.64 Current Human assembly: Build 34 (the essentially finished genome) N50- 29,105 N50 length: Contig length at which 50% of the bases in the assembly reside in a contig of at least that size. Contig information:

25 DM Church- NCBI Contigs and components in the MapViewer

26 DM Church- NCBI Aug. 2001 19.2M (3.5X) Oct. 2001 25.2M (4.5X) Nov. 2001 30M (5.5X) Feb. 2002 40.1M (7X) Mouse Genome Sequencing

27 DM Church- NCBI WGS Restrict and make libraries 2, 4, 8, 10, 40, 150 kb For mouse project only 40 kb clones and BAC clones are available BAC clones were constructed and end sequenced before WGS project started End-sequence all clones and retain pairing information “mate-pairs” Find sequence overlaps Each end sequence is referred to as a read WGS contig David Jaffe Jim Mullikin tails Putting genomes together

28 DM Church- NCBI Constructing Supercontigs (scaffolds) David Jaffe Jim Mullikin Putting genomes together

29 DM Church- NCBI Intermediate assemblies Sanger Institute Jim Mullikin WIBR David Jaffe NCBI Richa Agarwala Victor Sapojnikov Wratko Hlavina Deanna Church Contig N50 Arachne Phusion Oct 4.5X Nov 5.5X Feb 7X 6.6 Kb 6.2 Kb 9.6 Kb 11.3 Kb 24.8 Kb 20.1 Kb SuperContig N50 Oct 4.5X Nov 5.5X Feb 7X 60.3 Kb 33.5 Kb 110.7 Kb 88.6 Kb 17,727 Kb 6495 Kb Transcript Coverage mRNA Arachne Phusion Oct 4.5X Nov 5.5X Feb 7X 57.49% 68.14% 72.40% 74.47% 95.71% 96.88% RefSeq Oct 4.5X Nov 5.5X Feb 7X 61.38% 73.25% 79.74% 80.98% 98.91% 99.32%

30 DM Church- NCBI The Starting Material * Assumes a 2.75 Gb genome The Assembly 224,713 WGS contigs Total length of the assembly: 2.5 Gb (90.9 % of genome)* 42,620 Supercontigs N50 of mapped supercontigs: 17 Mb N50 of unmapped supercontigs: 4.9 Kb 40.7 million WGS reads (2,4,6,10,40 Kb) ~450,000 BAC end sequences RPCI-23: 197 Kb RPCI-24: 155 Kb CAAA01000100 -Length of contigs > 1kb: 2.53 Gb -Length of contigs with >= 1 BES: 2.06 Gb -Length of contigs with >= 1 mapped STS:.344 Gb -N50 length: 24.8 Kb -Mapped: 173550 NW_XXXXXX -Length of sc >= 1 BES: 2.41 Gb -Length of sc>= 1 mapped STS: 2.4 Gb -N50 length: 17.7 Mb -Mapped: 366 ChrUn The Mouse Genome- MGSCv3 David Jaffe- Arachne Jim Mullikin- Phusion (The Mouse Genome Sequencing Consortium) (+ 274 finished BACs – 49.5 Mb) Waterston et al, 2004

31 DM Church- NCBI 7 The Mouse Genome- over time… Finished Draft WGS Gap 1 2 3 4 5 6 8 9 10 11 12 13 14 15 16 17 18 19 X MGSCv3

32 DM Church- NCBI Contig/Supercontig size by chromosome 0 10 20 30 40 50 60 70 80 12345678910111213141516171819X Contig (Kb) Supercontig (Mb)

33 DM Church- NCBI How does MGSCv3 compare to Non-Sequence based maps Chromosome 7 ~80% of STS markers on WI-Genetic Map localized by e-PCR ~72% of STS markers on WI/MRC RH Map localized by e-PCR <3% chromosome conflict. WI-Gen map WI/MRC RH map

34 DM Church- NCBI Finished NT Contig By Build Build 29 Build 30 Build 32 Build 33 Estimated Length Finished sequences are used to build hand-curated contigs (NT contigs) Currently ~1.8 Gb (mostly) non-redundant sequence 1.1 Gb in Build 33

35 DM Church- NCBI Mouse Build 30: Integrated 730 Mb of Finished C57BL/6J sequence into the assembly MGSCv3 was used as a Tiling Path to guide the assembly Freeze date: Jan 27, 2003 Release date: Feb 27, 2003 The Mouse Genome- over time… NCBI Richa Agarwala Finished Draft WGS Gap 9 1 3 4 5 6 7 8 10 11 12 13 14 15 16 17 18 19 X 2

36 DM Church- NCBI The Mouse Genome- combining resources… NCBI Richa Agarwala Deanna Church Unplaced versus Total curated Contigs Build 30 Unplaced Total contigs.56%.27% 1.83% 1.93% 4.07% 3.64% 3.61% 1.19% 2.94% 0 0 5.56% 1.38% 4.48% 0 0 1.27% 1.41% 0 0.9% 100% 780 Mb of Curated NT Sequence

37 DM Church- NCBI The Mouse Genome- combining resources… NCBI Richa Agarwala Deanna Church Mmu4 unplaced contigs (Build 30) 10 unplaced NT contigs (11 GenBank accessions) Do align to WGS contigs mapped to Mmu4 Align to WGS contigs mapped to another chromsome No hits/bad hits (mostly chrUn) NT_039271 NT_039272 NT_039276 NT_039280 NT_039273 (MmuX)NT_039269 NT_039270 NT_039274 NT_039278 NT_039279

38 DM Church- NCBI IntrachromosomalInterchromosomal Large, nearly identical copies of genomic DNA. > 1 Kb, > 90% identity Segmental Duplications Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

39 DM Church- NCBI Segmental Duplications WGAC Analysis: Whole Genome Assembly Comparison WSSD Analysis: Whole Genome Shotgun Sequence Detection BLAST the genome against itself and look for sequence similarity. caveat: difficult to distinguish between biological duplication and artificial duplication introduced when producing draft assemblies. BLAST WGS reads against an assembly and look for increased depth of coverage Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

40 DM Church- NCBI Segmental Duplications MGSCv3 (>20Kb; >95%) Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

41 DM Church- NCBI Segmental Duplications MGSCv3 (>90% ID; >10 Kb) 60% of all duplication map to chrUn in MGSCv3 Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

42 DM Church- NCBI Comparison of duplication in the Mouse and Human Genomes Human- Build 31 (2.75 Gb) >1 KB >5 Kb >10 Kb >20 Kb 5.25% 4.78% 4.52% 4.06% MGSCv3 (2.55 Gb) w/ unplw/o unpl ND 1.95% 0.70% 0.11% 1.01% 0.38% 0.10% Mouse Build 29 (0.439 Gb – Finished BACs only) initialfiltered 3.74% 3.25% 2.71% 2.23% 2.35% 2.00% 1.60% 1.14% WGAC analysis Duplications are underrepresented in the Whole Genome Assembly (MGSCv3) Segmental Duplications Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

43 DM Church- NCBI Segmental Duplications Unique: pre-quality score Unique: post-quality score Duplicated: pre-quality score Duplicated: post-quality score WSSD Finished BACs Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

44 DM Church- NCBI Segmental Duplications WSSD (>95% id) analysis of Build 30 BACs >10 Kb >20 Kb >5 Kb >1 Kb BACs MGSCv3 w/ Unw/o Un ND 1.51% 1.46% 2.09%0.27% 2.01%0.23% (4298 BACs tested) 141 dup pos BACs The 6 BACs (5 NT clones) from Mmu4 that hit chrUn are on the duplication positive list Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

45 DM Church- NCBI Segmental Duplications Case Western Reserve Evan Eichler Jeff Bailey

46 DM Church- NCBI Segmental Duplications Bari Italy Mario Ventura Mariano Rochi RP23-3D2 chr.X_A3 Validated 18/27 (67%) In silico predictions by FISH 16/18 (~90%) were clustered intrachromosomal duplications This region described in Mileham and Brown (1996) as ‘a repeat sequence island’

47 DM Church- NCBI Segmental Duplications Gene Content of Duplications DomainUDEnrichment serpin39657.5 lectin_c75419.9 7tm20835.4 ANF_receptor34333 Defensin_propep33373.5 KRAB68316.5 defensins23560.3 lipocalin23232.5 AAA35110.7 DEAD4119.1 ENV_polyprotein4193.4 MAGE5174.7 RNA_helicase10137.4 Human: ~ 5% of the Genome is in Duplicated regions ~6% of RefSeqs align to these regions Mouse: ~1.5-2% of the Genome is in Duplicated regions ~0.5% of RefSeqs align to these regions Case Western Reserve Evan Eichler Jeff Bailey NCBI Deanna Church

48 DM Church- NCBI chr13.250.3811.580.5766.510.21 chr22.030.136.570.3242.110.08 chr32.170.115.230.1669.090.08 chr42.190.2712.120.6938.640.19 chr52.810.4214.960.8847.920.31 chr63.720.379.970.8643.000.27 chr74.480.7817.412.1037.160.64 chr81.540.159.540.2754.630.12 chr91.560.106.110.3428.030.08 chr101.620.105.940.1951.390.08 chr111.130.086.940.2136.630.07 chr121.790.3921.850.8844.420.37 chr131.860.4122.081.0140.660.38 chr141.190.1512.390.3344.380.14 chr150.940.043.870.0577.470.04 chr161.080.010.750.0240.640.01 chr173.350.226.620.9922.300.26 chr180.750.022.620.0287.520.02 chr190.920.055.530.3116.520.09 chrUn23.7813.0354.8082.0215.8912.91 chrX3.170.319.910.8636.410.23 both non redundant dup WGAC (Mb) WSSD supported WGAC (Mb) WSSD overlap WGAC (%) WSSD (Mb) WGAC overlap WSSD (%) Proportion of WSSD supported WGAC in chrom(%) MGSCv3 Duplication Analysis Build 33 data Evan Eichler Xinwei She Ginger Chang Eray Tuzan Deanna Church

49 DM Church- NCBI Mouse Build 32: Integrated Finished and Draft C57BL/6J sequence into the assembly Clone based Tiling Paths for chromosomes: 2,4,5,7,11,15,18,19,X,Y MGSCv3 used as a Tiling Path for chromosomes: 1,3,6,8,9,10,12,13,14,16,17 Freeze date: Sep 26, 2003 Release date: Nov 4, 2003 The Mouse Genome- combining resources… NCBI Richa Agarwala Deanna Church Many problems: -in silico duplication -genes ‘thrown off’ chromosomes 1 2 4 5 6 7 8 9 10 12 13 14 15 16 17 18 19 X Y 3 11 Finished Draft WGS Gap

50 DM Church- NCBI Mouse assemblies: Build 32 Framework assemblies: Contig information: Range in kbNumberLength Percent of total <300393731.94x10 8 7.10 300-1000724.23x10 7 1.54 1000-50001163.14x10 8 11.46 >50001562.19x10 9 79.92 All contigs Mapped contigs 397172.74 Gb 4402.55Gb NumberLength Range in kbNumberLength Percent of total <300988.33x10 6 0.33 300-1000704.13x10 7 1.62 1000-50001163.14x10 8 12.30 >50001562.19x10 9 85.76 All Mapped

51 DM Church- NCBI

52

53

54 Mapped Scaffold N50 Build 30 Build 32 all_clone finished_clone all_wgs finished_wgs Combined_2

55 DM Church- NCBI all_clone all_wgs finished_wgs combined The Mouse Genome- combining resources… Refseqs with mulitple alignments to the genome

56 DM Church- NCBI Finished Sequence in 'Random' Bin all_clone all_wgs fin_combined combined_2

57 DM Church- NCBI The Mouse Genome- combining resources… NCBI Richa Agarwala Deanna Church Mouse Build 33 (current) Clone based TPF -Finished + draft + wgs -Finished + wgs MGSCv3 based TPF -Finished + draft + wgs -Finished + wgs Combined TPF -Finished + draft + wgs -Finished + wgs Clone based TPF -local Order and Orientation problems MGSCv3 based TPF -Increased artificial duplication -Lots of finished sequence in random bin Combined TPF - Not perfect, but better outcome. Manual curation helps And the winner is…

58 DM Church- NCBI 1 2 3 4 5 6 7 8 9 10 11 121314 15 16 17 18 19 X Y WGS Finished Draft Gap Build 33 Reference assembly N50: 22.3 Mb

59 DM Church- NCBI Chromosome 7 inversion still present…

60 DM Church- NCBI Mmu7 (3M – 6M)

61 DM Church- NCBI Segmental Duplication: Genome annotation will under-represent the gene content if segmental duplications are not included in the reference assembly.

62 DM Church- NCBI Large scale variation in the genome Nature Genetics, Sept. 2004

63 DM Church- NCBI Types of annotation Genes: By alignment, by prediction Markers: By ePCR Clones/Cytogenetic location: By alignment (BAC ends, insert) or assembly Variation: By alignment Phenotype: Cytogenetic Position: Feature Method Sequence characteristics: CpG islands, source of assembly Note: Genes from other organisms are also positioned based on alignment of mRNAs from one species on that of another genome. Example: the human Map Viewer shows the position of ESTs and other mRNAs from cow, pig, mouse, and rat. Via Gene identification, associated markers By annotated BAC-END sequenced clones By FISH-mapped clones used in assembly Gene Trap Clones: By alignment

64 DM Church- NCBI Goal: One sequence entry for each naturally occurring DNA, RNA and protein molecule NC_000000 NM_000000 NR_000000 NP_000000 XM_000000/ XR_000000 XP_000000 chromosome NT_000000/ NW_000000 contig RNA predicted RNA protein predicted protein NG_000000 genomic Key: Curated annotation Calculated annotation Key: Curated annotation Calculated annotation Multiple products for one gene are instantiated as separate RefSeqs with the same LocusID. Reference Sequences…

65 DM Church- NCBI Why do we need RefSeq? Entrez Nucleotide

66 DM Church- NCBI General alignment: –at least 50% of length or >1.0 kb –>95% identity, unless short exon –No longer one alignment per contig per strand (changed recently because this led to failure to annotate all members of a gene cluster) –Constraints on intron length (compactness) –Shift within 3 nt to find splice sites conforming to consensus (GT-AG, GC-AG, AT-AC) –Rank alignment by bit score, % identity, score, gaps, compactness –global alignment Best placement: –Add to score for introns to compensate for gap penalty –Known ambiguity if gene/pseudogene pairs are highly related, and few introns in gene mRNA alignment Sim4 est2genome Spidey BLAT SPLIGN

67 DM Church- NCBI Aligning cDNAs to the genome -Different algorithms can produce different results -Trying to balance alignment with searching for splice sites. ACAG++++++++++GAG ||| ACATGTxxxxACAGGAG Sim4 AC++++++++++AGGAG || ||||| ACATGTxxxxACAGGAG splign/gpipe/BLAT spidey ACA++++++++++GGAG ||| |||| ACATGTxxxxACAGGAG NM_003490 (synapsin 3) Between exons 7 and 8:

68 DM Church- NCBI Making Gene Models (at NCBI) Align RefSeq mRNAs to the genome Select the best alignment (by score? exon structure?) Run ab initio gene prediction on regions between these alignments We use gnomon (GeneScan, GenomeScan, TwinScan, SGP) Select best gene models RefSeq alignments (NM_XXXXXXXXX) ab initio models with support (XM_XXXXXXXXX) Known issues: Don’t make ab initio models in introns of known genes Skewed to what we known Don’t really predict non-coding RNAs well Hard to sort out gene vs. pseudo-genes

69 DM Church- NCBI

70 Integrated comparison with Ensembl and UCSC Placement of CDS Placement of and consensus splice junctions % identity between RefSeq and Genome Reading frame Possible Actions Review current evidence Review alignment algorithms Review current RefSeqs Integrated comparison with Ensembl and UCSC Placement of CDS Placement of and consensus splice junctions % identity between RefSeq and Genome Reading frame Possible Actions Review current evidence Review alignment algorithms Review current RefSeqs Conflict resolution

71 DM Church- NCBI CCDS identifier assigned to annotated proteins that are consistently placed Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome –cDNA ( and thus protein ) from a different allele –RNA editing –selenoproteins –ribosomal slippage –non-AUG initiation codon –cDNA source has undetected sequence errors CCDS identifier assigned to annotated proteins that are consistently placed Sequence may not be identical because NCBI annotates and places existing RefSeqs that are based on cDNAs and Ensembl generates mRNA and protein products solely from the reference genome –cDNA ( and thus protein ) from a different allele –RNA editing –selenoproteins –ribosomal slippage –non-AUG initiation codon –cDNA source has undetected sequence errors Future consensus annotation

72 DM Church- NCBI Preliminary Statistics based on Human Build 34.3 CountTotalConditions Satisfied 78027802100% nucleotide+position 14999301100% protein+position 305312336100% exon position 2312359 NCBI/Hinxton both "good" 154013899 NCBI annotation projected 177215671 One model better 5215723Other model better Future consensus annotation

73 DM Church- NCBI Now that the genome is together I. Text based queries Entrez: - organism restriction - molecule type restrictions - keyword restrictions II. Sequence comparisons BLAST (Basic Local Alignment Search Tool). SSAHA (Sequence Search and Alignment by Hashing Algorithm) BLAT III. Query by location Base pair position cM position cytogenetic position http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?taxid=10090

74 DM Church- NCBI http://www.ncbi.nlm.nih.gov/genome/seq/MmBlast.html http://www.ncbi.nlm.nih.gov/genome/seq/MmBlast.html /HsBlast.html /RnBlast.html /DrBlast.html Assembled Sequence Reference assembly (C57BL/6J) Alternate assemblies Celera Mouse 16 Input Sequences HTGS WGS Traces All other Traces BAC ends Transcribed Sequences Reference mRNA Build RNA ESTs Proteins Reference proteins Build proteins DATABASES Entry point into the Genome- view BLAST results in the Map Viewer Data Access Other data sets Gene Trap Clones

75 DM Church- NCBI Data Access

76 DM Church- NCBI Navigating by location Jump to chromosome 15M 30M Add & Remove Maps Change Map Order Add Rulers Add another organism

77 DM Church- NCBI Multiple assemblies can be a good thing… Alignment of human Reference mRNAs: 256: Reference assembly only 10: Celera assembly only Assembly Gaps Assembly Errors Biological variation

78 DM Church- NCBI Mulitple assemblies can be a good thing…

79 DM Church- NCBI

80

81 Mulitple assemblies can be a good thing… + - + + + + NM_004947NM_004947 181 tgaaggggatctttcctgcaaattacattcacttgaaaaaggcaattgtcagtaataggg 240 + AY254099AY254099 181............................................................ 240 + AY145303AY145303 158............................................................ 217 + AY145302AY145302 509.a......c................t.....t.......................c.... 568 + AK172930AK172930 518.a......c................t.....t.......................c.... 577 + AK122353AK122353 445.c.....t..a......t.c.gc..tg.............t..ctg...a.ag..c.aa. 504 + AY233380AY233380 158.c.....t..a......t.c.gc...g.............t..ctg...a.ag..c.aa. 217 + AC121608AC121608 21865.....c................t.....t.......................c.... 21921+ AL672208AL672208 61296.....c................t.....t.......................c.... 61240+ Reference Assembly Celera Assembly Other sequence data indicate the reference assembly includes an inversion: Inversions: An exon of DOCK3 is inverted in the reference assembly relative to other available information.

82 DM Church- NCBI Mulitple assemblies can be a good thing…

83 DM Church- NCBI

84 Genome assembly and annotation is an ongoing issue. Weigh all of the evidence carefully Multiple lines of evidence better than a single thread

85 DM Church- NCBI Take home messages… Genome assembly and annotation is still not a trivial problem Be critical and review the evidence… http://www.ncbi.nlm.nih.gov/projects/assembly

86 DM Church- NCBI Assembly Database NCBI Eugene Yaschenko Vladimir Alekseyev Mike Dicuccio Deanna Church TIGR Martin Shumway Steve Salzberg

87 DM Church- NCBI Acknowledgments RefSeq Curator Staff BLAST Team Entrez Team NCBI Service Desk Staff Genome Team: Richa Agarwala Hsiu-Chuan Chen Slava Chetvernin Deanna Church Olga Ermolaeva Wratko Hlavina Wonhee Jang Jonathan Kans Yuri Kapustin Ken Katz Paul Kitts Donna Maglott Jim Ostell Kim Pruitt Sergey Resenchuk Victor Sapojnikov Greg Schuler Steve Sherry Andrei Shkeda Alexandre Souvorov Tatiana Tatusova Lukas Wagner Trace and Assembly Archive Vladimir Alekseyev Anton Butanaev Alexey Egorov Andrew Klymenko Sergey Pomorov Eugene Yaschenko Mike Dicuccio Duplication Analysis Evan Eichler Xinwei She Ze Cheng Eray Tuzan Jeff Bailey Mario Ventura Mariano Rocchi

88 DM Church- NCBI Mouse Genome Sequencing Consortium Sanger Institute Washington University Genome Sequencing Center Whitehead (Broad) Institute Genome Cener Baylor College of Medicine Cold Spring Harbor Laboratory Genome Therapeutics Corporation Harvard Partners Genome Center Joint Genome Institute NIH Intramural Sequencing Center UK-MRC Sequencing Consortium The University of Oklahoma Advanced Center for Genome Technology The University of Texas Southwest Acknowledgments


Download ppt "DM Church- NCBI Assembling and Annotating Genomes Deanna M. Church NCBI January 12, 2005."

Similar presentations


Ads by Google