Aligning Transcribed Sequences to the Human and Mouse Genomes

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Transcriptome Sequencing with Reference
Tutorial 7 Genome browser. Free, open source, on-line broswer for genomes Contains ~100 genomes, from nematodes to human. Many tools that can be used.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
CSE182-L12 Gene Finding.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 1.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Generic substitution matrix -based sequence similarity evaluation Q: M A T W L I. A: M A - W T V. Scr: 45 -?11 3 Scr: Q: M A T W L I. A: M A W.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Research about Alternative Splicing recently 楊佳熒.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
What is BLAST? Basic BLAST search What is BLAST?
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Web Databases for Drosophila
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
Introduction to Genes and Genomes with Ensembl
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
The Transcriptional Landscape of the Mammalian Genome
Distribution of Introns among Full Length cDNA
Genome sequence assembly
Basics of BLAST Basic BLAST Search - What is BLAST?
TSS Annotation Workflow
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
From: TopHat: discovering splice junctions with RNA-Seq
BLAST.
Introduction to Bioinformatics II
Ensembl Genome Repository.
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
Basic Local Alignment Search Tool
Rationale for GUS Answer queries:
Current and Future Directions
Information Management Infrastructure for the Systematic Annotation of Vertebrate Genomes V Babenko (1), B Brunk (1), J Crabtree (1), S Diskin (1), Y Kondrahkin.
Learning to count: quantifying signal
Integrating Genomic Databases
Leveraging EST Sequencing, Micro Array Experiments and Database Integration for Gene Expression Analyses The Computational Biology and Informatics Laboratory.
Functional Genomics Consortium: NIDDK (Kaestner) and (Permutt)
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Sequence Analysis - RNA-Seq 2
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
Presentation transcript:

Aligning Transcribed Sequences to the Human and Mouse Genomes Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania

The Genomes: Human Recent events Current public draft sequence June 2000: “working drafts” announced Feb. 2001: first analyses published Feb. 2002: UCSC exits assembly business Current public draft sequence July, 2002: NCBI Build #30 June 28, 2002 freeze of GenBank data 87% finished seq., est. 94-97% coverage

The Genomes: Mouse Recent events (public sequence) Late 2000: shotgun sequencing begun Late 2001: first assemblies created April 2002: Arachne chosen over Phusion Current public draft sequence April, 2002: MGSCv3 February, 2002 freeze of ~7X shotgun Estimated 90-95% coverage

The Transcribed Sequences dbEST expressed sequence tags (ESTs) ~4 million human ~2.5 million mouse Highly variable quality GenBank mRNAs and RefSeqs Many are “full length”, high quality Includes RIKEN cDNAs Did not include GenBank HTC division

DoTS: Database of Transcribed Sequences Cluster ESTs & mRNAs by similarity Assemble the clusters with CAP4 Annotate resulting consensus seqs. Predict protein sequences Run BLAST searches Predict GO function Link to RH maps, gene trap cell lines, expression data, MGI, GeneCards, etc. Results at http://www.allgenes.org

A Sample DoTS Assembly

Is EST assembly still relevant? Not every organism has genome project EST sequencing is still a relatively cheap way to survey a transcriptome Though array-based approaches are also very powerful if the sequence is known Not every EST will necessarily align to the draft genome Annotation component is useful, regardless of assembly method

Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA)

Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA) exon 1 exon 2 exon 3 *** DRAMATIZATION ***

What are the goals? Find genes & delineate their boundaries Investigate alternative splicing Validate DoTS assemblies Gain insight into sources of error Assess whether we gain anything by assembling ESTs before aligning them

Potential “unsplicing” tools BLAST Good general-purpose local alignment tool But not well-suited to this specific task Special-purpose alignment tools e.g., est2genome (Birney, Durbin), est_genome (Mott), sim4 (Florea et al.) Do a good job but are very slow

Unsplicing: our first attempt BLAST-sim4 heuristic algorithm Employs a two-step approach BLASTN - find candidate locations sim4 – perform precise alignments Much faster than sim4 alone But still slow for whole-genome analysis Similar to Spidey (Wheelan et al.)

Unsplicing: BLAT BLAT: BLAST-Like Alignment Tool Written by Jim Kent @ UCSC Indexes target db, not query sequence Takes advantage of additional constraints Adjusts exon boundaries using splice signals Attempts to locate small exons 500x speedup with no loss of sensitivity

Overview of alignment process BLAT searches (vs. human and mouse) RefSeqs and DoTS consensus sequences Load alignments into database Compute summary information Including alignment “quality” Merge selected alignments into “genes”

Introducing GUS plugin LoadBLATAlignments Process raw BLAT output Perl modules BLAT::Alignment, BLAT::PSL Load alignments into GUS BLATAlignment table (not Similarity) 10% minimum length cutoff applied Compute and store summary info. Alignment quality (requires target seq.) Poly(A) detection (requires query seq.) max_query_gap, unaligned_3p_bases, etc.

BLATAlignmentQuality Very good (formerly “consistent”) >= 95% identity (average) max_query_gap <= 5 both ends consistent no more than 10bp mismatch unless polyA not polyA on both ends

BLATAlignmentQuality II Very good with gaps same as very good but internal and end mismatches allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length for ends) Good same as very good, but with max_query_gap <= 15 (allow large internal gaps if there is a sufficiently large genomic sequence gap), and inconsistent ends allowed if unaligned_bases <= 50 Not so good everything else

Why “very good with gaps” and how we arrived at it? Align Refseqs to hChr22 and mChr5 Compare consistent (very good) alignments to annotations at UCSC False positives: close to 0 False negatives: ~18% and ~35% With new quality filter, false negatives reduced to ~15% and ~13%

Why “good” and how we arrived at it? If a Refseq has very good (consistent) alignment, would the Refseq-containing assembly too? hChr22: 98/255 (38%) did not mChr5: 109/271 (40%) did not Mostly due to minor problems at end(s) With new filter, false negatives reduced to 25/255 (~10%) and 33/271 (~12%)

Some alignment statistics hDoTS (08/02) vs hGenome (GP 06/02) Total DoTS sequences: 859,545 Alignments loaded: 5,544,300 / 8,975,529 Quality #Align. (non-singlton) #Seq. (non-singleton) #Align. / #Seq. 1 343,819 (119,301) 303,555 (107,818) 1.13 (1.11) 2 3,554 ( 1,108) 3,305 ( 1,069) 1.08 (1.04) 3 262,885 ( 78,228) 195,248 ( 60,596) 1.35 (1.29) 1,2,3 610,258 (198,637) 494,023 (166,960) 1.24 (1.19) 4 4,934,042 (809,600) 292,320 (79,155) 16.9 (10.23)

Some alignment statistics II mDoTS (07/02) vs mGenome (GP 02/02) Total DoTS sequences: 579,906 Alignments loaded: 3,208,572/4,663,903 Quality #Align. (non-singlton) #Seq. (non-singlton) #Align. / #Seq. 1 163,270 (57,993) 155,444 (56,035) 1.05 (1.03) 2 64,062 (23,556) 25,470 ( 8,524) 2.52 (2.76) 3 140,542 (40,063) 101,565 (32,932) 1.38 (1.22) 1,2,3 367,874 (121,612) 271,476 (93,695) 1.36 (1.30) 4 2,840,698 (883,619) 300,595 (52,493) 9.45 (16.8)

“Gene” creation algorithm Select BLAT alignments Parameters: min quality, genomic region Merge overlapping alignments Merge nearby alignments with at least one EST sequence in each assembly from common clone Parameter: max distance (default 20kb) Merge nearby alignments Parameter: max distance (default off)

“Gene” creation algorithm II [ygan@zeus ~/dots2gene]$ ./dotsAlignment2Gene.pl -h Usage: dotsAlignment2Gene.pl --sp sp --chr chr --start start --end end --qf qf --xs xs --am am --cm cm --lm lm --mis mis --of out [--test] [--debug debug] [--help], where... sp: scientific/common name of species, e.g. human, Mus musculus chr: chromosome of interest, e.g. 5, 22, X, 3_random start: start genomic position, default to 1 end: end genomic position, default to chromosome length qf: quality filter to select blat alignments for gene creation xs: exclude blat alignments of singletons am: merge by genomic alignment overlap cm: merge by shared clone info between gene seeds within specified distance lm: merge by alignment proximity (within specified distance) mis: min intron size for a gene to be kept out: output format, one of s[ummary], v[erbose] or gff test: test case using DeiGeorge region debug: specify level of debug output, can be 1, 2, 3, and 4+

Initial Algorithm Calibration Human chr22q (~34Mb) as test case Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) Focus on DiGeorge Region DGCR6 to ZNF74 (~ 1.6Mb) Contains 24-33 genes based on literature (Sanger: 44 genes with 33 known) * Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).

Choosing initial parameters # DiGeorge Chromosome Region (DGCR6 - ZNF74, 1.6Mb) # CBIL Gene Param Num CBIL Num Sanger Num Overlap Avg %overlap qf=4, am, cm=10k 27/50 26/44 28 88.7 vs 71.3 qf=4, am, cm=20k 24/47 26/44 27 81.4 vs 75.5 qf=4, am, cm=50k 20/39 25/44 26 63.8 vs 77.6 qf=6, am, cm=10k 26/69 29/44 30 77.7 vs 75.9 qf=6, am, cm=20k 25/66 28/44 31 69.8 vs 80.5 qf=6, am, cm=50k 17/54 24/44 25 53.0 vs 87.4 # Chr22 (Chr22q ~34M) qf=4, am, cm=20k 335/737 352/829 383 70.7 vs 72.4 qf=6, am, cm=20k 327/1074 377/829 399 64.9 vs 81.0 * Sanger annotation in different coordinate system, did approximate translation

Initial parameters Derived from old alignments qf=4: “spliced” and “consistent” alignments xs=off: not exclude alignments by singleton assemblies am=on: merge by alignment overlap cm=20K: merge by shared clone lm=off: merge by genomic location proximity mis=15: filter putative genes by max “intron” size Adjusted for new alignment quality categorization qf=7: “spliced” alignment of quality id 1, 2 or 3

Preliminary results Applied algorithm to new alignments of hChr22 and mChr5 [4-6 hours each] Displayed as custom tracks at UCSC genome browser DiGeorge region CBIL and Sanger genes Human chr22 CBIL and Sanger genes Mouse chr5 CBIL genes

Preliminary full genome runs Excluded singletons Tried lm parameter at off, 30, 50 ~24 hours per run, with some stress on database server Per chromosome statistics for mouse See next slide

Lm off, qf7/am/xs/cm20k/mis15 Lm 30 qf7/am/xs/cm20k/mis15 (* publicly visible DoTS only) Lm 50 mChr #assm. #gene ratio Ratio 1 2599 1705 1.52 2246 1449 1.55 1448 2 3525 2097 1.68 3024 1777 1.70 3025 1774 3 2162 1367 1.58 1860 1159 1.60 4 2665 1621 1.64 2279 1390 1389 5 2843 1747 1.63 2429 1487 2430 1485 6 2245 1409 1.59 1954 1200 7 3325 1997 1.66 2864 1696 1.69 1695 8 2216 1364 1.62 1912 1150 1148 1.67 9 2478 1503 1.65 2118 1287 1286 10 2063 1274 1784 1095 11 3723 2188 3160 1821 1.74 12 1541 958 1.61 1358 822 1359 13 1501 1008 1.49 1315 878 1.50 1316 877 14 1548 982 1355 836 15 1907 1129 1652 893 1.85 891 16 1489 927 1322 758 757 1.75 17 2194 1356 1910 1054 1.81 1053 18 1236 731 1044 595 593 1.76 19 1544 921 769 1.71 X 1320 867 1108 753 1.47 Un 1874 2306 0.81 1713 1665 1.03 1659 Total <44124 27151 <38009 22869 <38013 22851 128615 90829 1.42

Directions Assessment of results in selected 14Mb on mouse chr5 (Maja Bucan lab)

Directions II Quantitative evaluation of results Correlation coefficients – next slide (Science Kapranov et al. 296 (5569): 916.)

Directions III Fine tune/revise algorithm parameters qf: recruit more alignments cm: widen from 20K to 500K? lm: turn on (lm<75bp or 0<lm<75bp)? mis: intron size distribution model? Evidence suggesting new parameters See next slide Conservatively assume UCSC ref genes uniformly samples 1/3 of all genes on chr22

hChr21 RefGenes hChr22 RefGenes hChr22 gene bounds Genomic Sizes Max + - Genomic Sizes # 88 95 195 182 370 380 Min 189 408 675 309 190 343 Med 34619 20084 16903 17225 10744 9875 Avg 53134 59300 31290 35909 31999 27700 Max 259519 834228 288888 647340 715044 647564 <0 Distances 0-10 >10 11 22 19 14 -148864 -117263 -135416 -647338 -665965 -190248 -47831 -17229 -19026 -20601 -46312 -33082 -60639 -26611 -29580 -57429 -112114 -76554 -7955 -5807 -1752 -476 -3179 -8554 1 6 76 83 172 162 354 366 1138 223 216 108 109 104203 144877 62721 67898 27971 34914 372763 379780 161200 174314 70366 68781 4099173 6896846 3517239 3055720 861971 889255

Directions IV Problem fixes Error rate estimations Handle assemblies on the wrong strand - see next slide Error rate estimations Simulate effects of sequence/assembly errors on BLAT

Ongoing work Combine alignments with other sequence signals (Artemis) Detailed examination of regions of interest on mouse chr. 5 (Maja) Incorporate alignments into DoTS assembly process

Acknowledgements Alignments Database of Transcribed Sequences Yongchang Gan (see poster!) Database of Transcribed Sequences Brian Brunk Steve Fischer Deborah Pinney Manual annotation Joan Mazzarelli Kolchanov group