Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12:
Tools for understanding the sequence, evolution, and function of the human genome. Jim Kent and the Genome Bioinformatics Group University of California.
Genomic Innovations- Orthology Paralogy. Genomic innovation.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Sequence Similarity Searching Class 4 March 2010.
Sequencing and Sequence Alignment
The Genes, the Whole Genes, and Nothing But the Genes Jim Kent University of California Santa Cruz.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
The UCSC Genome Browser From Men to Mice WJ Kent, C Sugnet, T Furey, T Pringle, M Schwartz, R Baertsch, R Weber, K Roskin, D Thomas, S Rogic, M Diekhans,
Sequence Alignments Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Comparison Intragenic - self to self. -find internal repeating units. Intergenic -compare two different sequences. Dotplot - visual alignment.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
Tools for understanding the sequence, evolution, and function of the human genome. Jim Kent and the Genome Bioinformatics Group University of California.
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
David Haussler Howard Hughes Medical Institute University of California, Santa Cruz Assembly, Comparison, and Annotation of Mammalian Genomes.
Spaghetti Code, Soupy Logic Jim Kent - University of California Santa Cruz Steaming fresh modules in sourceforge.net Combinatorical assembly of transcription.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
[Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Scaffold Download free viewer:
Sequencing a genome and Basic Sequence Alignment
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
[Bejerano Fall11/12] 1 Primer Friday 10am Beckman B-302 Introduction to the UCSC Browser.
CS273A Lecture 11: Comparative Genomics II
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
Chapter 11 Assessing Pairwise Sequence Similarity: BLAST and FASTA (Lecture follows chapter pretty closely) This lecture is designed to introduce you to.
1 The Genome Browser allows you to –Browse the Rice-Japonica, Maize and Arabidopsis genomes. –View the location of a particular feature on the rice genome.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
Galaxy: Integrative, Reproducible Analysis of Genomics Data Genomic and Proteomic Approaches to Heart, Lung, Blood and Sleep Disorders Jackson Laboratories.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
Sequencing a genome and Basic Sequence Alignment
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Construction of Substitution Matrices
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
+ => Bioinformatics: from Sequence to Knowledge Outline: Introduction to bioinformatics The TAU Bioinformatics unit Useful bioinformatics issues and databases:
Kerstin Lindblad-Toh Whitehead/MIT Center for Genome Research Michael Kamal Broad/MIT Center For Genome Reseach.
Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
CS273A Lecture 15: Inferring Evolution: Chains & Nets II
Sequence comparison: Local alignment
Ab initio gene prediction
Problems from last section
It is the presentation about the overview of DOT MATRIX and GAP PENALITY..
Presentation transcript:

Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University

Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7

Typical Gene Level View: Sialic Acid Binding/Ig-like Lectin 7

Known Gene Details Page

PDB Ribbon Diagram 4 clicks away by the wonder of the world wide web

Hox A Cluster, Many Tracks

Track Controls are Now Grouped

Packed mode saves space, makes labels easier to find.

Squished mode is ideal for ESTs and mouse/human homology

ESTs hint at a smaller version of exon2

Publication Quality Output

Comparative Genomics

Chaining Alignments Chaining bridges the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures.

Chains join together related local alignments Protease Regulatory Subunit 3

Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

Gaps are needed in Both Sequences in the General Case of Pair-Wise Alignment otherwise non-homologous bases can be forced to pair

2-D histogram of observed gaps. The horizontal axis is gaps in human, the vertical axis is gaps in mouse. The logarithm of counts of gaps in bins of 10 (left) and bins of 500 (right) are plotted as levels of gray with black representing the highest counts. Note the concentration of gaps along the axis, particularly for shorter gaps.

Before and After Chaining

Chaining Algorithm Input - blocks of gapless alignments from blastz Dynamic program based on the recurrence relationship: score(B i ) = max(score(B j ) + match(B i ) - gap(B i, B j )) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i

Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

Net Focuses on Ortholog

Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Mouse/Human Rearrangement Statistics Number of rearrangements of given type per megabase.

A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

year of the rat Rat Genome

Rat/Mouse/Human Genome-Wide Multiz Alignments Available Eye lense protein gamma crystallin a. Upstream region (on right) is highly conserved but not a CpG island. Alignments are interrupted by numerous recent transposon insertions.

Details page offers quick access to browsers on corresponding regions of other genomes. It also highlights exons in base-by-base alignments.

Zoom to Base Level Detail near translation start of tubulin 8

Zoom to Base Level Intron consensus sequence visible.

Zoom to Base Level Possible alt-splice not consensus and not conserved.

Tiling the genome in Microarrays New genes on 21 and 22?

Cross-hybridization at Work Zoomed in on right side:

>hg15_rnaCluster_chr range=chr22: 'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none aactccgcctcggggccccggggcgccgcctctctcccccggggcgccgc ctctctcccccggggcgccgcctccctccgccgcggccgtcgagccgcgg agcgcctcttccgcggagccgccgcctgccaggattccagcgccgcagct gcggccgcagccattggtctctgacgtcagcggcgtgcggcgcactcggc >hg15_rnaCluster_chr range=chr22: 'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ccagggcagggcgaggagcgcggggaggggccgcggggacccgggccgct ggggccgtggggcccgcccggccgccggccggctccctggggcgcgggcg gctgcgtcagcggggggcggagacgcggcgctgcttccgctcacgcgcgc cctgctccctcctcccagtcgtcctggtccgcggcgcccaacggggaaga >hg15_rnaCluster_chr range=chr22: 'pad=0 3'pad=0 revComp=FALSE strand=+ repeatMasking=none gccctcccggtccgggggcggggcttggcctggggcggggcttggctggg gtgctcagcccaattttccgtgtagggagcgggcggcggcgggggaggca gaggcggaggcggagtcaagagcgcaccgccgcgcccgccgtgccgggcc tgagctggagccgggcgtgagtcgcagcaggagccgcagccggagtcaca >hg15_rnaCluster_chr range=chr22: 'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none actcagaagctaagataccgacggtgttcctctgaacttcttccaatggc taaaagctacaagcgcctcagatataaaagactcctggacggattttcat ccagcacagagcagctgaatccatatttggcagctagtggatgggataag aggcctaacagtaagcccatggcactttattctctcgaatccatcaagat >hg15_rnaCluster_chr range=chr22: 'pad=0 3'pad=0 revComp=TRUE strand=- repeatMasking=none ggccccgcgccccaggccggggcgaggccttttccggcgcttctttcccg cggagccgcgggcgggcggcgcaggccctgggggagagcgcgccgcggcc ggttgcagccccccccgcgccgccgcgttcggcgcccggcccggccagtc tgctcctgccccgccgccgcgccggagcccgggcgcccgaagctgggggc 200 Bases Upstream of Known Genes 5’ Extended by RNA/EST clusters

Acknowledgements Individuals Institutions NHGRI, The Wellcome Trust, HHMI, NCI, Taxpayers in the US and worldwide. Baylor, Sanger, Wash U, Whitehead, Stanford, JGI/ DOE, Oklahoma U and the international sequencing centers. UCSC, NCBI, EBI, Ensembl, Genoscope, MGC, Intel, TIGR, Jackson Labs, Affymetrix, SwissProt. Webb Miller, Chuck Sugnet, Robert Baertsch, Scott Schwartz, Fan Hsu, Terry Furey, Ross Hardison, David Haussler, Richard Gibbs, Bob Waterston, Eric Lander, Francis Collins, LaDeana Hillier, Roderic Guigo, Michael Brent, Olivier Jaillon, David Kulp, Victor Solovyev, Ewan Birney, James Gilbert, Greg Schuler, Deanna Church, the Gene Cats. Everyone else!

THE END

A Cautionary Note Infant digestive systems very permeable, uptake antibodies ~10% of infants are allergic to cow’s milk based formula These infants get soy/corn based formula As we engineer plants, let’s be careful what we put in infant formula

New Algorithms and Data ‘Chaining’ and ‘netting’ of mouse/human alignments precisely define orthology and quantify rearrangements. Rat genome is browsable and used in rat/mouse/human multiple alignments. Cross-hybridization potential of Affymetrix-style microarrays calculated and displayed.

Ideal Gap Penalties Would allow gaps in both sequences at once Would penalize long gaps less than affine gap scores. Still would be quick to compute. We use a piecewise linear function of the sum of gap sizes plus a substantial penalty for gaps that are in both sequences at once.