Introduction to Genes and Genomes with Ensembl
Large amounts of raw DNA sequence data CGGCCTTTGGGCTCCGCCTTCAGCTCAAGACTTAACTTCCCTCCCAGCTGTCCCAGATGACGCCATCTGAAATTTCTTGGAAAC ACGATCACTTTAACGGAATATTGCTGTTTTGGGGAAGTGTTTTACAGCTGCTGGGCACGCTGTATTTGCCTTACTTAAGCCCCT GGTAATTGCTGTATTCCGAAGACATGCTGATGGGAATTACCAGGCGGCGTTGGTCTCTAACTGGAGCCCTCTGTCCCCACTAGC CACGCGTCACTGGTTAGCGTGATTGAAACTAAATCGTATGAAAATCCTCTTCTCTAGTCGCACTAGCCACGTTTCGAGTGCTTA ATGTGGCTAGTGGCACCGGTTTGGACAGCACAGCTGTAAAATGTTCCCATCCTCACAGTAAGCTGTTACCGTTCCAGGAGATGG GACTGAATTAGAATTCAAACAAATTTTCCAGCGCTTCTGAGTTTTACCTCAGTCACATAATAAGGAATGCATCCCTGTGTAAGT GCATTTTGGTCTTCTGTTTTGCAGACTTATTTACCAAGCATTGGAGGAATATCGTAGGTAAAAATGCCTATTGGATCCAAAGAG AGGCCAACATTTTTTGAAATTTTTAAGACACGCTGCAACAAAGCAGGTATTGACAAATTTTATATAACTTTATAAATTACACCG AGAAAGTGTTTTCTAAAAAATGCTTGCTAAAAACCCAGTACGTCACAGTGTTGCTTAGAACCATAAACTGTTCCTTATGTGTGT ATAAATCCAGTTAACAACATAATCATCGTTTGCAGGTTAACCACATGATAAATATAGAACGTCTAGTGGATAAAGAGGAAACTG GCCCCTTGACTAGCAGTAGGAACAATTACTAACAAATCAGAAGCATTAATGTTACTTTATGGCAGAAGTTGTCCAACTTTTTGG TTTCAGTACTCCTTATACTCT AACTAAGAATTTAAGGCTGGG CCAGAAGTTTGAGACCAGCCT GTGCCTGTAATCCCAGCTACA ATGCCACTGCACTCTAGCCTG TAAAAATGATCTAGGACCCCCGGAGTGCTTTTGTTTATGTAGCT CGTGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAG GGCCAACATGGTGAAACCCTATCTCTACTAAAAATACAAAAAAT CGGGAGGTGGAGGCAGGAGAATCGCTTGAACCCTGGAGGCAGAG GGCCACATAGCATGACTCTGTCTCAAAACAAACAAACAAACAAA Large amounts of raw DNA sequence data TACCATATTAGAAATTTAA GTGGGCGGATCACTTGAGG GTGCTGCGTGTGGTGGTGC GTTGCAGTGAGCCAAGATC AAACTAAGAATTTAAAGTT AATTTACTTAAAAATAATGAAAGCTAACCCATTGCATATTATCACAACATTCTTAGGAAAAATAACTTTTTGAAAACAAGTGAG TGGAATAGTTTTTACATTTTTGCAGTTCTCTTTAATGTCTGGCTAAATAGAGATAGCTGGATTCACTTATCTGTGTCTAATCTG TTATTTTGGTAGAAGTATGTGAAAAAAAATTAACCTCACGTTGAAAAAAGGAATATTTTAATAGTTTTCAGTTACTTTTTGGTA TTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGATATACCATAGGTCTTTCCCATGTCGCAACATCATGCAGTGA TTATTTGGAAGATAGTGGTGTTCTGAATTATACAAAGTTTCCAAATATTGATAAATTGCATTAAACTATTTTAAAAATCTCATT CATTAATACCACCATGGATGTCAGAAAAGTCTTTTAAGATTGGGTAGAAATGAGCCACTGGAAATTCTAATTTTCATTTGAAAG TTCACATTTTGTCATTGACAACAAACTGTTTTCCTTGCAGCAACAAGATCACTTCATTGATTTGTGAGAAAATGTCTACCAAAT TATTTAAGTTGAAATAACTTTGTCAGCTGTTCTTTCAAGTAAAAATGACTTTTCATTGAAAAAATTGCTTGTTCAGATCACAGC TCAACATGAGTGCTTTTCTAGGCAGTATTGTACTTCAGTATGCAGAAGTGCTTTATGTATGCTTCCTATTTTGTCAGAGATTAT TAAAAGAAGTGCTAAAGCATTGAGCTTCGAAATTAATTTTTACTGCTTCATTAGGACATTCTTACATTAAACTGGCATTATTAT TACTATTATTTTTAACAAGGACACTCAGTGGTAAGGAATATAATGGCTACTAGTATTAGTTTGGTGCCACTGCCATAACTCATG CAAATGTGCCAGCAGTTTTACCCAGCATCATCTTTGCACTGTTGATACAAATGTCAACATCATGAAAAAGGGTTGAAAAAAGGA ATATTTTAATAGTTTTCAGTTACTTTTTGGTATTTTTCCTTGTACTTTGCATAGATTTTTCAAAGATCTAATAGATATACCCGA
Making Sense out of Sequence … http://www.ensembl.org http:// www.ncbi.nlm.nih.gov/mapview http://genome.ucsc.edu
The Ensembl genome browser: making it interesting The ENCODE (ENCyclopedia Of DNA Elements) project Science 306: 636-640 (2004) Genes Variation Regulatory elements 9
Vertebrate species on Ensembl Mostly vertebrates
Non‐vertebrates on Ensembl genomes Fungi Bacteria Protists Metazoa Plants www.ensemblgenomes.org
Ensembl and EnsemblGenomes
Ensembl gene models Automatic annotation Manual annotation
Automatic gene annotation Genome-wide determination using the Ensembl automated pipeline Predictions based on the genomic sequence (ab initio) Predictions based on experimental (biological) data ESTs RNAseq data cDNA and protein alignments (from sequence DBs)
Biological Evidence International Nucleotide Sequence databases Protein sequence databases Swiss-Prot: manually curated TrEMBL: unreviewed translations NCBI RefSeq Manually annotated proteins and mRNAs (NP, NM)
Manual gene annotation Gene determination on a case by case basis by a curator • Genome-wide Genes list h v
Ensembl automatic annotation
Automatic annotation Many species (>60) Genome-wide at once Manual annotation Few species (Hs, Mm, Dr) Gene-by-gene
Golden transcripts Identical annotation • Higher confidence and quality gf 3’ UTR 5’ UTR UTR Intron Exon Exons are drawn as boxes – filled boxes are coding and unfilled boxes are untranslated. Introns are drawn as lines.
CCDS transcripts Consensus coding DNA sequence set Agreement between EBI, WTSI, UCSC and NCBI • http://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi CCDS transcript vg
Higher quality transcripts CCDS transcripts (protein-coding only) Ensembl/Havana merged transcripts Both a limited number of species
Ensembl stable IDs ENSG########### Ensembl Gene ID ENST########### Ensembl Transcript ID ENSP########### Ensembl Peptide ID ENSE########### Ensembl Exon ID For non‐human species a suffix is added: MUS (Mus musculus) for mouse ENSMUSG### DAR (Danio rerio) for zebrafish: ENSDARG###
NCBI http://www.youtube.com/ncbinlm Go to www.youtube.com Search “NCBI tutorial general”
The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH Establish public databases Research in computational biology Develop software tools for sequence analysis – Disseminate biomedical information
Three international nucleotide sequence databases
Selected NCBI Databases Biomedical literature PubMed free Medline PubMed Central full text online access NCBI Bookshelf online biomedical textbooks Biomolecular Databases Nucleotide GenBank submitted sequence records RefSeq curated NCBI reference sequences Protein GenBank and RefSeq translations, outside protein dbSNP small scale genetic variations Structure biomolecular 3-D structures MMDB NCBI’s 3D structure database GEO microarray expression data SRA next-generation sequence data
GenBank & RefSeq
RefSeq: NCBI’s Derivative Sequence Database Experimentally verified / curated transcripts and proteins NM_, NP_ accession numbers Model transcripts and proteins XM_, XP_ accession numbers Assembled Genomic Regions (contigs) NT_, NW_ accession numbers Chromosome records NC_, AC_ accession numbers RefSeqGene Records NG_ accession numbers (NG_ also used pseudo genes and other fixed genomic sequences) Draft whole genome shotgun assemblies (microbial) NZ_ accession numbers Microbial proteins NP_, YP_, ZP_ accessions
UCSC Genome Browser https://genome.ucsc.edu/
GeneCards http://www.genecards.org/