Aligning Transcribed Sequences to the Human and Mouse Genomes

Aligning Transcribed Sequences to the Human and Mouse Genomes
Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania

The Genomes: Human Recent events Current public draft sequence
June 2000: “working drafts” announced Feb. 2001: first analyses published Feb. 2002: UCSC exits assembly business Current public draft sequence July, 2002: NCBI Build #30 June 28, 2002 freeze of GenBank data 87% finished seq., est % coverage

The Genomes: Mouse Recent events (public sequence)
Late 2000: shotgun sequencing begun Late 2001: first assemblies created April 2002: Arachne chosen over Phusion Current public draft sequence April, 2002: MGSCv3 February, 2002 freeze of ~7X shotgun Estimated 90-95% coverage

The Transcribed Sequences
dbEST expressed sequence tags (ESTs) ~4 million human ~2.5 million mouse Highly variable quality GenBank mRNAs and RefSeqs Many are “full length”, high quality Includes RIKEN cDNAs Did not include GenBank HTC division

DoTS: Database of Transcribed Sequences
Cluster ESTs & mRNAs by similarity Assemble the clusters with CAP4 Annotate resulting consensus seqs. Predict protein sequences Run BLAST searches Predict GO function Link to RH maps, gene trap cell lines, expression data, MGI, GeneCards, etc. Results at

A Sample DoTS Assembly

Is EST assembly still relevant?
Not every organism has genome project EST sequencing is still a relatively cheap way to survey a transcriptome Though array-based approaches are also very powerful if the sequence is known Not every EST will necessarily align to the draft genome Annotation component is useful, regardless of assembly method

Aligning transcripts with DNA
5’ UTR CDS ’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA)

Aligning transcripts with DNA
5’ UTR CDS ’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA) exon exon exon 3 *** DRAMATIZATION ***

What are the goals? Find genes & delineate their boundaries
Investigate alternative splicing Validate DoTS assemblies Gain insight into sources of error Assess whether we gain anything by assembling ESTs before aligning them

Potential “unsplicing” tools
BLAST Good general-purpose local alignment tool But not well-suited to this specific task Special-purpose alignment tools e.g., est2genome (Birney, Durbin), est_genome (Mott), sim4 (Florea et al.) Do a good job but are very slow

Unsplicing: our first attempt
BLAST-sim4 heuristic algorithm Employs a two-step approach BLASTN - find candidate locations sim4 – perform precise alignments Much faster than sim4 alone But still slow for whole-genome analysis Similar to Spidey (Wheelan et al.)

Unsplicing: BLAT BLAT: BLAST-Like Alignment Tool
Written by Jim UCSC Indexes target db, not query sequence Takes advantage of additional constraints Adjusts exon boundaries using splice signals Attempts to locate small exons 500x speedup with no loss of sensitivity

Overview of alignment process
BLAT searches (vs. human and mouse) RefSeqs and DoTS consensus sequences Load alignments into database Compute summary information Including alignment “quality” Merge selected alignments into “genes”

Introducing GUS plugin LoadBLATAlignments
Process raw BLAT output Perl modules BLAT::Alignment, BLAT::PSL Load alignments into GUS BLATAlignment table (not Similarity) 10% minimum length cutoff applied Compute and store summary info. Alignment quality (requires target seq.) Poly(A) detection (requires query seq.) max_query_gap, unaligned_3p_bases, etc.

BLATAlignmentQuality
Very good (formerly “consistent”) >= 95% identity (average) max_query_gap <= 5 both ends consistent no more than 10bp mismatch unless polyA not polyA on both ends

BLATAlignmentQuality II
Very good with gaps same as very good but internal and end mismatches allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length for ends) Good same as very good, but with max_query_gap <= 15 (allow large internal gaps if there is a sufficiently large genomic sequence gap), and inconsistent ends allowed if unaligned_bases <= 50 Not so good everything else

Why “very good with gaps” and how we arrived at it?
Align Refseqs to hChr22 and mChr5 Compare consistent (very good) alignments to annotations at UCSC False positives: close to 0 False negatives: ~18% and ~35% With new quality filter, false negatives reduced to ~15% and ~13%

Why “good” and how we arrived at it?
If a Refseq has very good (consistent) alignment, would the Refseq-containing assembly too? hChr22: 98/255 (38%) did not mChr5: 109/271 (40%) did not Mostly due to minor problems at end(s) With new filter, false negatives reduced to 25/255 (~10%) and 33/271 (~12%)

Some alignment statistics
hDoTS (08/02) vs hGenome (GP 06/02) Total DoTS sequences: 859,545 Alignments loaded: 5,544,300 / 8,975,529 Quality #Align. (non-singlton) #Seq. (non-singleton) #Align. / #Seq. 1 343,819 (119,301) 303,555 (107,818) 1.13 (1.11) 2 3,554 ( 1,108) 3,305 ( 1,069) 1.08 (1.04) 3 262,885 ( 78,228) 195,248 ( 60,596) 1.35 (1.29) 1,2,3 610,258 (198,637) 494,023 (166,960) 1.24 (1.19) 4 4,934,042 (809,600) 292,320 (79,155) 16.9 (10.23)

Some alignment statistics II
mDoTS (07/02) vs mGenome (GP 02/02) Total DoTS sequences: 579,906 Alignments loaded: 3,208,572/4,663,903 Quality #Align. (non-singlton) #Seq. (non-singlton) #Align. / #Seq. 1 163,270 (57,993) 155,444 (56,035) 1.05 (1.03) 2 64,062 (23,556) 25,470 ( 8,524) 2.52 (2.76) 3 140,542 (40,063) 101,565 (32,932) 1.38 (1.22) 1,2,3 367,874 (121,612) 271,476 (93,695) 1.36 (1.30) 4 2,840,698 (883,619) 300,595 (52,493) 9.45 (16.8)

“Gene” creation algorithm
Select BLAT alignments Parameters: min quality, genomic region Merge overlapping alignments Merge nearby alignments with at least one EST sequence in each assembly from common clone Parameter: max distance (default 20kb) Merge nearby alignments Parameter: max distance (default off)

“Gene” creation algorithm II
~/dots2gene]$ ./dotsAlignment2Gene.pl -h Usage: dotsAlignment2Gene.pl --sp sp --chr chr --start start --end end --qf qf --xs xs --am am --cm cm --lm lm --mis mis --of out [--test] [--debug debug] [--help], where... sp: scientific/common name of species, e.g. human, Mus musculus chr: chromosome of interest, e.g. 5, 22, X, 3_random start: start genomic position, default to 1 end: end genomic position, default to chromosome length qf: quality filter to select blat alignments for gene creation xs: exclude blat alignments of singletons am: merge by genomic alignment overlap cm: merge by shared clone info between gene seeds within specified distance lm: merge by alignment proximity (within specified distance) mis: min intron size for a gene to be kept out: output format, one of s[ummary], v[erbose] or gff test: test case using DeiGeorge region debug: specify level of debug output, can be 1, 2, 3, and 4+

Initial Algorithm Calibration
Human chr22q (~34Mb) as test case Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) Focus on DiGeorge Region DGCR6 to ZNF74 (~ 1.6Mb) Contains genes based on literature (Sanger: 44 genes with 33 known) * Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).

Choosing initial parameters
# DiGeorge Chromosome Region (DGCR6 - ZNF74, 1.6Mb) # CBIL Gene Param Num CBIL Num Sanger Num Overlap Avg %overlap qf=4, am, cm=10k 27/ / vs 71.3 qf=4, am, cm=20k 24/ / vs 75.5 qf=4, am, cm=50k 20/ / vs 77.6 qf=6, am, cm=10k 26/ / vs 75.9 qf=6, am, cm=20k 25/ / vs 80.5 qf=6, am, cm=50k 17/ / vs 87.4 # Chr22 (Chr22q ~34M) qf=4, am, cm=20k / / vs 72.4 qf=6, am, cm=20k / / vs 81.0 * Sanger annotation in different coordinate system, did approximate translation

Initial parameters Derived from old alignments
qf=4: “spliced” and “consistent” alignments xs=off: not exclude alignments by singleton assemblies am=on: merge by alignment overlap cm=20K: merge by shared clone lm=off: merge by genomic location proximity mis=15: filter putative genes by max “intron” size Adjusted for new alignment quality categorization qf=7: “spliced” alignment of quality id 1, 2 or 3

Preliminary results Applied algorithm to new alignments of hChr22 and mChr5 [4-6 hours each] Displayed as custom tracks at UCSC genome browser DiGeorge region CBIL and Sanger genes Human chr22 CBIL and Sanger genes Mouse chr5 CBIL genes

Preliminary full genome runs
Excluded singletons Tried lm parameter at off, 30, 50 ~24 hours per run, with some stress on database server Per chromosome statistics for mouse See next slide

Lm off, qf7/am/xs/cm20k/mis15
Lm qf7/am/xs/cm20k/mis15 (* publicly visible DoTS only) Lm 50 mChr #assm. #gene ratio Ratio 1 2599 1705 1.52 2246 1449 1.55 1448 2 3525 2097 1.68 3024 1777 1.70 3025 1774 3 2162 1367 1.58 1860 1159 1.60 4 2665 1621 1.64 2279 1390 1389 5 2843 1747 1.63 2429 1487 2430 1485 6 2245 1409 1.59 1954 1200 7 3325 1997 1.66 2864 1696 1.69 1695 8 2216 1364 1.62 1912 1150 1148 1.67 9 2478 1503 1.65 2118 1287 1286 10 2063 1274 1784 1095 11 3723 2188 3160 1821 1.74 12 1541 958 1.61 1358 822 1359 13 1501 1008 1.49 1315 878 1.50 1316 877 14 1548 982 1355 836 15 1907 1129 1652 893 1.85 891 16 1489 927 1322 758 757 1.75 17 2194 1356 1910 1054 1.81 1053 18 1236 731 1044 595 593 1.76 19 1544 921 769 1.71 X 1320 867 1108 753 1.47 Un 1874 2306 0.81 1713 1665 1.03 1659 Total <44124 27151 <38009 22869 <38013 22851 128615 90829 1.42

Directions Assessment of results in selected 14Mb on mouse chr5 (Maja Bucan lab)

Directions II Quantitative evaluation of results
Correlation coefficients – next slide (Science Kapranov et al. 296 (5569): 916.)

Directions III Fine tune/revise algorithm parameters
qf: recruit more alignments cm: widen from 20K to 500K? lm: turn on (lm<75bp or 0<lm<75bp)? mis: intron size distribution model? Evidence suggesting new parameters See next slide Conservatively assume UCSC ref genes uniformly samples 1/3 of all genes on chr22

hChr21 RefGenes hChr22 RefGenes hChr22 gene bounds Genomic Sizes Max
+ - Genomic Sizes # 88 95 195 182 370 380 Min 189 408 675 309 190 343 Med 34619 20084 16903 17225 10744 9875 Avg 53134 59300 31290 35909 31999 27700 Max 259519 834228 288888 647340 715044 647564 <0 Distances 0-10 >10 11 22 19 14 -47831 -17229 -19026 -20601 -46312 -33082 -60639 -26611 -29580 -57429 -76554 -7955 -5807 -1752 -476 -3179 -8554 1 6 76 83 172 162 354 366 1138 223 216 108 109 104203 144877 62721 67898 27971 34914 372763 379780 161200 174314 70366 68781 861971 889255

Directions IV Problem fixes Error rate estimations
Handle assemblies on the wrong strand - see next slide Error rate estimations Simulate effects of sequence/assembly errors on BLAT

Ongoing work Combine alignments with other sequence signals (Artemis)
Detailed examination of regions of interest on mouse chr. 5 (Maja) Incorporate alignments into DoTS assembly process

Acknowledgements Alignments Database of Transcribed Sequences
Yongchang Gan (see poster!) Database of Transcribed Sequences Brian Brunk Steve Fischer Deborah Pinney Manual annotation Joan Mazzarelli Kolchanov group

Aligning Transcribed Sequences to the Human and Mouse Genomes

Similar presentations

Presentation on theme: "Aligning Transcribed Sequences to the Human and Mouse Genomes"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Aligning Transcribed Sequences to the Human and Mouse Genomes

Similar presentations

Presentation on theme: "Aligning Transcribed Sequences to the Human and Mouse Genomes"— Presentation transcript:

Similar presentations

About project

Feedback