Tools for understanding the sequence, evolution, and function of the human genome. Jim Kent and the Genome Bioinformatics Group University of California.

Slides:



Advertisements
Similar presentations
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 12:
Advertisements

Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Lecture 14 Genome sequencing projects
Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
Evolution and the Santa Cruz Genome Browser Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz Pennsylvania State University.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
Visualizing Genes and Evolution Jim Kent Genome Bioinformatics Group University of California Santa Cruz.
The UCSC Genome Browser From Men to Mice WJ Kent, C Sugnet, T Furey, T Pringle, M Schwartz, R Baertsch, R Weber, K Roskin, D Thomas, S Rogic, M Diekhans,
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
How to access genomic information using Ensembl August 2005.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Tools for understanding the sequence, evolution, and function of the human genome. Jim Kent and the Genome Bioinformatics Group University of California.
[Bejerano Fall10/11] 1 HW1 Due This Fri 10/15 at noon. TA Q&A: What to ask, How to ask.
David Haussler Howard Hughes Medical Institute University of California, Santa Cruz Assembly, Comparison, and Annotation of Mammalian Genomes.
UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
[Bejerano Fall09/10] 1 This Friday 10am Beckman B-200 Introduction to the UCSC Browser.
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
[Bejerano Spr06/07] 1 TTh 11:00-12:15 in Clark S361 Profs: Serafim Batzoglou, Gill Bejerano TAs: George Asimenos, Cory McLean.
Bioinformatics Genome anatomy Comparisons of some eukaryotic genomes Allignment of long genomic sequences Comparative genomics Oxford Grid Reconstruction.
Compartmentalized Shotgun Assembly ? ? ? CSA Two stated motivations? ?
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Aut07/08] 1 MW 11:00-12:15 in Redwood G19 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome sequencing. Vocabulary Bac: Bacterial Artificial Chromosome: cloning vector for yeast Pac, cosmid, fosmid, plasmid: cloning vectors for E. coli.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
Sequencing a genome and Basic Sequence Alignment
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
[Bejerano Fall11/12] 1 Primer Friday 10am Beckman B-302 Introduction to the UCSC Browser.
The Genome Genome Browser Training Materials developed by: Warren C. Lathe, Ph.D. and Mary Mangan, Ph.D. Part 1.
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
Mouse Genome Sequencing
The Ensembl Gene set The “Genebuild” 21 April 2008.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Center for Biomolecular Science and Engineering University of California, Santa Cruz Robert Kuhn, PhD Center for Biomolecular Science and Engineering University.
The UCSC Genome Browser Introduction
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Copyright OpenHelix. No use or reproduction without express written consent1.
Sequencing a genome and Basic Sequence Alignment
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Chapter 21 Eukaryotic Genome Sequences
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Sackler Medical School
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Mark D. Adams Dept. of Genetics 9/10/04
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
Human Genome.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Comparative Genomics I: Tools for comparative genomics
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Accessing and visualizing genomics data
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Virginia Commonwealth University
Human Genome Project.
Today… Review a few items from last class
Ensembl Genome Repository.
Sequence the 3 billion base pairs of human
Human Genome Project Seminal achievement. Scientific milestone.
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Tools for understanding the sequence, evolution, and function of the human genome. Jim Kent and the Genome Bioinformatics Group University of California Santa Cruz

The Goal Make the human genome understandable by humans.

Step 1 Sequence the human genome

Idealized Hierarchical Shotgun Sequencing

Mapping 300,000 BAC Clones Were Digested and Run on Agarose Gels Cari Soderlund’s FPC and Wash U Pathfinders Made Fingerprint Map Contigs Genetic and radiation hybrid maps placed contigs on chromsomes Bob Waterston escaping management

Sequence and Assembly BAC Clones shotgun sequenced at high throughput to 4x ‘draft’. Assembled with Phil Green’s Phrap

GigAssembler Jim Kent David Haussler (meanwhile Celera working on whole genome shotgun version)

The Truth +-?++?+-?--?+-?+-?++?+-?--?+-? Keeping strands straight is the hard part + light - darkness

“Finishing” Sequence Using primers to end of contigs close gaps. Checking automatic assembly especially near tandem repeats. Checking in-silico restriction digest of BAC matches actual digest. Time consuming - 1 year to ‘draft’ genome, 2 years to ‘finish’. Human finished. Mouse will be finished (currently half finished). Other genomes may stay at draft stage, though draft stage can be very good these days.

Now What? TGGCTTTTGAAGGGAGTTCTGTTTATATATACGTCAACATCCAGTTGGAGGTGAAAAGGTTAGCACTTGACCCAGGAAGTATCCATGTTTGTTTCAAAAA TAAATCTGCTTCATAAATTTCTTCATCAGTCTTTTTTTCCATTATGAGCTTTGATTATAATAAAGGAGCTGTTATTAACTTTTATTCAAGAAAAGGCCCA TCTCTTTGAAAATATTTACCACCCTTCTCCCTTTCCCCTCATGAAATGTGCCAACTTCATAGGAATTAACAAATTGTAGCCCAGCCAAATACACGGATGC TTAAGCATACCTGAAACTTGAGTATATTTATTTATTACAGACATCCTAAGACCCGTAAACTCTGCTCTGGATCATATCACTCCAGGATCTCAGAGCTGTT CATGATTGTACAGGAAATGGGGAATATCATAGGCTCACAAAGGATAACTGATAGAACTCAGTGTGGTACTTTGGGGACATCAAACATTGTGCGACATGCA AAAGACTATTCACGAATAACACAAAATATACATTCATTGTGCCATCCATCACATTAACAATTGAGCTGAAAATACATTATATCCAGCTAAGATAACTGTG GAAGGAAGAAATTGGTTTGAATAATACTTTTAGGTTCTGAATAACCCAGCACAAATTTTAAACAGAGGGTGGCCCGAGAAGAAAGGGGTAGAGATTGGGA AAGACTTAGCACAGGAAGCCGGGTTTCTGAAGTTTGTGCTCTGCAGGGCTTCTTAACTGTAAGAACAAATCAAGGCTACCCTCTGAGGCATCTGATTGGG TTTAAATGAGGGAATTTTTTCTTTCACCTATAAAATTGTACCAGTTTAGAGAGTTTGCCCACCCTGTTTTAGTAACCTAAACATTTCTAGAAAATCTGTA TAAAGATAAATCTCTTAGGACAAAGTATTTACAACCAGCAAACTCACACACATGAAAATGACTTAAATTAAGGGATGAATTAATTGTGTAAACATATAGT GCATCTCTTCTTCCTGAGCTCCTGGACTCGCCTTTCGCTATATCCTACTTTCAAGGACAAGGGAGGGGAGAGCTGTACATATAGTTAGATAAAAGATGAG AAGATTCCTTCTGGCATGTTTCTGTTGGCAAAGGGAACTATTTTCCAAAAGGTCATCTGAAAGGAACAGTAGGTTCTGTGAATTCTCCTAAAAGCAGGAG GGATGTTAAGGCCCACCAGAAAATGTATGCTGGCACCCAATCTGGATGAAGGTGTTAACCCCGCACCAAGTCTCTGGTCCAGAATTATCTGCAAATATAT TATCCTGGCCAGGAGCTCCCCAGATAGGATTAGAAAGGAAGAAAGAGACTGTAAATGGAAAGAAAGATAAGCTAAGCATGTGCTTTGGGTAAGAAGTCCC AGCCCAAGGAGATGCCTGGGCTGTTGTCTGGGGCTGGAGCCGCCTCAGTGGGAGGTAGTCAGAGTGTCTGAGGTAGAAGACCCCGGGGAAGGAACGCAGG GCGAAGAGCTGGACTTCTCTGAGGATTCCTCGGCCTTCTCGTCGTTTCCTGGCGGGGTGGCCGGAGAGATGGGCAAGAGACCCTCCTTCTCACGTTTCTT TTGCTTCATTCGGCGGTTCTGGAACCAGATCTTCACTTGGGTCTCGTTGAGCTGCAGGGATGCAGCGATCTCCACCCTGCGGGCGCGCGTCAGGTACTTG TTGAAGTGGAACTCCTTCTCCAGTTCCGTGAGCTGCTTGGTAGTGAAGTTGGTGCGCACCGCGTTGGGTTGACCCAGGTAGCCGTACTCTCCAACTTTCC CTGGGGCAAAGTGGGAAGCCATGAGACGGAAATGTAAAAATTTTTAAATCGACTTGAGATTCCCCACACGCTTCATGGCAACACTCAGGTAAAGAAAAGA TCAAGAACTCAGCACAAATCGGGCTGTGGAGGGTGAGTGATGAGGTGTAAAGTGTTAACCTGATGTAAACCATTAGCATGGTCAGACCGGTGATTAATGG AGCCTCAAGATATTAACAGAACACTACCGTCACAATAACCACCCCCACATACTTCCTATTTCCCAAATGTATAAAATCCTTGAAAACACACCAATCCCTG AGACTTCTTTGCCCCAACACCTCTGGGCACCCTCTCCATGCACTACAACACTAGTCTGATACAAAAGCCTTTTAAAAAAAAGATCATTATTAATTTCCTT GGAAATTAAGCATACCAGCTCCTTCCAGAATAATCAAGGAGCATCCACCAACCAGCAGGACTGACCTGTTTTGGGAGGGTTTCTTTTGACTTTCATCCAG TCAAAAGTCTGCGCTGGAGAAGATGTCTCCGATGCGGGGGAGCGACAGGCTTCTTGGTGGCTGGCGTGGAGAGGGGACAAGGAGTTATTATACGTAGCCA GGGCCAGGCTCTGGTGCTCCTGTCCATATGAGTGGTGAATGTATTGAGGCGAGCCCACCGCGCCCCCAGCATAACCCTGGTGGTGGTGGTGATGCTGGAC CATGGGAGATGAGAGATTTCCAGAGTAAACAGCGGGAGCGCACTGGGGGTACCCACCACTTACGTCTGCTTCCTGATTTAACGCGTAGGGGCTGTAAGGC GCACTGAAGTTCTGTGAGCCATAGCTTGGACCACAACTTGAGTGGGAGTAGGACACCCCCAGGTTCCCGGAAGTCTGGTAGGTAGCCGGCTGGGGGTGGC GATGGTGGTGGTGGTGGTGGTGGTGGGGCGAACCGATCTGCACCCCCCTGCCCACTAGGAAGCGGTCGTCGCCGCCGCAACTGTTGGCGCTGACCGCGCA CGACTGGAAAGTTGTAATCCTATGGTCCGAGGGGTAGGCTCGGGCTGAGCAGGTCCCCGAGTCGCCACTGCTAAGTATGGGGTATTCCAGGAAGGAGTTC ATTCTTGCATTGTCCATCTGTCACTGAGTGACCTGGTCCTGCGAAGCCCGGCGTGACTGTGCCAACTTTCTCACTTCCTC

Finding the Genes Dr. Blat helping a gene find itself.

SIGLEC7 - a gene with some transcriptional complexity. Sialic Acid Binding/Ig-like Lectin 7 displayed in UCSC Genome Browser

Genes: Lines of Evidence Full length human mRNA (the best!) Protein homology with other species. EST evidence - 1st step for much mRNA. Evidence from genome/genome alignments HMM based gene finders

Transferrin Receptor in UCSC Genome Browser

Clicking on a “known gene” brings up a large page of information on the gene. Transferrin

Current state of human genome ~99% of human genome sequenced. Last 1% will still be a challenge. ~85% of human genes located. Substantial resources are being devoted to last 15%. ~20% of human genes with any depth of functional annotation. Curation and integrated database are key to progress. <1% of human regulatory regions located.

Transferrin Receptor Note peaks of conservation in 3’ UTR. These include iron response elements which regulate translation of this gene.

Comparative Genomics Webb Miller

Comparative Genomics at BMP10

Conservation of Gene Features Conservation pattern across 3165 mappings of human RefSeq mRNAs to the genome. A program sampled 200 evenly spaced bases across 500 bases upstream of transcription, the 5’ UTR, the first coding exon, introns, middle coding exons, introns, the 3’ UTR and 500 bases after polyadenylatoin. There are peaks of conservation at the transition from one region to another.

Chaining Alignments Chaining bridges the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures.

Chains join together related local alignments Protease Regulatory Subunit 3

Affine penalties are too harsh for long gaps Log count of gaps vs. size of gaps in mouse/human alignment correlated with sizes of transposon relics. Affine gap scores model red/blue plots as straight lines.

Before and After Chaining

Chaining Algorithm Input - blocks of gapless alignments from blastz Dynamic program based on the recurrence relationship: score(B i ) = max(score(B j ) + match(B i ) - gap(B i, B j )) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i

Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy.

Net Focuses on Ortholog

Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes.

Useful in finding pseudogenes Ensembl and Fgenesh++ automatic gene predictions confounded by numerous processed pseudogenes. Domain structure of resulting predicted protein must be interesting!

Mouse/Human Rearrangement Statistics Number of rearrangements of given type per megabase excluding known transposons.

A Rearrangement Hot Spot Rearrangements are not evenly distributed. Roughly 5% of the genome is in hot spots of rearrangements such as this one. This 350,000 base region is between two very long chains on chromosome 7.

Reconstructed ancestral (boreutherian) genome for one chromosome

Finding Function We’ve located 85% of the genes, on track for 95% in a year or two. We have SOME idea of what 30% of the genes do. We have virtually NO idea of what the rest do.

How to Find Function Homology - guilt by association. Orthologs very valuable. Genetics/knockouts - what happens when a gene gets broken? –RNAi is speeding this up amazingly in worms and other model organisms. Expression - when and where is gene used? –Microarrays, in situs, GFP fusions. Interactions - what molecules are touching? –Yeast 2 hybrid, Immunoprecipitations Literature - finding out what we already know.

Data Mining

Gene Sorter - info on sets of genes

Sorted by homology

Sorted by genome distance

Coping with Bioinformatics Tower of Babel

Up in Testes, Down in Brain

VisiGene Image browser for in-situ and other gene- oriented pictures Hopefully in the long run will have a million images covering almost all vertebrate genes. Currently has 6000 images covering 1000 mouse transcription factors courtesy of Paul Gray et al.

Gene Browser Staff Programming: Hiram Clawson, Mark Diekhans, Rachel Harte, Angie Hinrichs, Fan Hsu, Andy Pohl, Kate Rosenbloom, Chuck Sugnet, Docs, quality, support: Gill Barber, Ron Chao, Jennifer Jackson, Donna Karolchik, Bob Kuhn, Crystal Lynch, Ali Sultan- Qurraie, Heather Trumbower Computer systems: Jorge Garcia, Patrick Gavin, Paul Tatarsky

Comparative Genomics UCSC - Robert Baertsch, Gill Bejerano, Yontoa Lu, Jacob Pedersen, Katie Pollard, Adam Siepel, Daryl Thomas, David Haussler PSU - Laura Elnitski, Belinda Giardine, Ross Hardison, Minmei Hou, Scott Schwartz, Webb Miller,

Data Contributors Human Genome Project Genbank/DDJ/EMBL contributors Novartis GNF foundation Affymetrix, Perlegen, SNP Consortium SwissProt, Ensembl, EBI and NCBI Jackson Labs, RGD, Wormbase, Flybase Many contributors of gene prediction and other tracks.

Funding National Human Genome Research Institute Howard Hughes Medical Institute Taxpayers in the USA and California

THE END

Confounded Pseudogenes! Pseudogenes confound HMM and homology based gene prediction. Processed pseudogenes can be identified by: –Lack of introns (but ~20% of real genes lack introns) –Not being the best place in genome an mRNA aligns (be careful not to filter out real paralogs) –Being inserted from another chromosome since dog/human common ancestor (breaking synteny). –High rate of mutation (Ka/Ks ratio). Robert Baertsch at UCSC has produced a processed pseudogene track. Yontoa Lu working on a non-processed pseudogene track.

Close up of two processed pseudogenes

Detail Near Translation Start Note the relatively conserved base 3 before translation Start (constrained to be a G or an A by the Kozak Consensus sequence, and the first three translated bases (ATG).

Normalized eScores

Table browser - text-oriented browsing and data analysis of genome browser database.