Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Protein coding genes RNA genes (rRNA, snRNA, snoRNA, miRNA, tRNA) Structural DNA (centromeres, telomeres) Regulation-related sequences (promoters, enhancers, silencers, insulators) Parasite sequences (transposons) Pseudogenes (non-functional gene-like sequences) Simple sequence repeats Eukaryotic Genomes: Not only collections of genes
Blue: Prokaryotes Black: Unicellular eukaryotes Other colors: Multicellular eukaryotes (red = vertebrates) Eukaryotic Genomes: High fraction non-coding DNA Bron: Mattick, NRG, 2004
3 billion basepairs (3Gb) 22 chromosome pairs + X en Y chromosomes Chromosome length varies from ~50Mb to ~250Mb About protein-coding genes –compare with ~14000 for fruitfly en ~19000 for Nematode C. elegans Human Genome
Human genome Bron: Molecular Biology of the Cell (4 th edition) (Alberts et al., 2002) Only 1.2% codes for proteins, 3.5-5% is under selection Long introns, short exons Large spaces between genes More than half exists of repetitive DNA
Variation Along Genome sequence Nucleotide usage varies along chromosomes –Protein coding regions tend to have high GC levels Genes are not equally distributed across the chromosomes –Housekeeping generally in gene-dense areas –Gene-poor areas tend to have many tissue specific genes Bron: Ensembl
Chromosome organisation Bron: Lodish (4 th edition) DNA packed in chromatin Active genes in less dense chromatin (beads-on-a-string) Non-active genes often in densely packed chromatine (30-nm fiber) Gene regulation by changing chromatin density, methylation/acetylation of the histones Limited availability of chromatin information in genome browsers (post transcriptional modifications are currently under investigation with ChIP-on- chip experiments
Genome browsers UCSC NCBI Ensembl
Genome Browsing With the UCSC Genome Browser
UCSC Genome browser
Choose a species, an assembly and a gene
Gene search results
Genome browser
Genomic Datatypes (Tracks)
Transcription data rather complicated
Browser → Gene record
Gene record
Gene record (2)
Gene record (3)
Gene record (4) “best hit”
Gene record (5)
Genomic elements Genome browsers can be used to examine other things –Genomic sequence conservation –Pseudogenes –Duplications en deletions of pieces chromosome (Copy Number Variations, CNVs)
Genomic Sequence Conservation Not only protein coding parts are conserved in evolution Conserved non-coding genomic sequences can be involved in gene regulation (enhancers, silencers, insulators) With the UCSC browser one can examine genomic conservation
Genomic Conservation (UCSC)
Pseudogenes Pseudogenes “look” like (are homologous to) protein- coding genes, but are non-functional Two types: –Unprocessed pseudogenes (loss of function) –Processed pseudogenes (mRNAs that are retrotranscribed onto the genome they miss introns and sometimes have a polyA) The UCSC contains various databases of pseudogenes: –Yale pseudogenes (both types pseudogenes) –Vega pseudogenes (both types pseudogenes) –Retroposed genes (only processed pseudogenes)
Pseudogenes (UCSC)
Copy Number Variation People do not only vary at the nucleotide level (SNPs); short pieces genome can be present in varying number of copies (Copy Number Polymorphisms (CNPs) or Copy Number Variants (CNVs) When there are genes in the CNV areas, this can lead to variations in the number of gene copies between individuals With the UCSC browser CNVs can be examined
Copy Number Variation (UCSC)
Finding a sequence in the genome
BLAT – Search page
BLAT - Results
BLAT – “Details”
BLAT – “Browser”
Genome browsers UCSC Ensembl
Genome Browsing With the Ensembl Genome browser
Ensembl Genome browser
Het Human Genome
MapView – Overview chromosome
ContigView – Zooming in (compare UCSD)
ContigView (2)
GeneView – Gene record
TransView - mRNA Transcript
TransView - mRNA Transcript (2)
Alternative Transcripts Bron: Wikipedia (
GeneView - Show Alternative Transcripts
GeneSpliceView - Alternative Transcripts
Single Nucleotide Polymorphisms (SNPs) Sequence variations within a species Similar to mutations, but are simultaneously present in the population, and generaly have little effect Are being used as genetic markers (a genetic disease is e.g. associated with a SNP) ENSEMBL offers a nice SNP view
GeneView - Show SNPs
GeneSNPView - SNPs
GeneView - Show Protein
ProtView - Protein
ProtView - Protein Sequence
ProtView – Search proteins with the same domains
DomainView – Proteins with a certain domain (Interpro = SMART + PFAM + others)
ProtView - Find Proteins In the Same Protein Family
FamilyView – Alignments of homologous proteins
Finding Human Genes
Finding a human gene (2)
Blast
Blast (2)
UCSC vs Ensembl: Which is better ? They more or less contain the same information UCSC is a bit easier in use Ensembl gives more detailed information and more flexible data export Other small differences in data (e.g. UCSC has more extensive genomic conservation data) Whatever your are familiar with !!