Genome Annotation and the Human Genome

Slides:



Advertisements
Similar presentations
Introduction to genomes & genome browsers
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
ECE 501 Introduction to BME
Genome Annotation and the landscape of the Human Genome Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
BI420 – Course information Web site: Instructor: Gabor Marth Teaching.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
Eukaryotic Gene Finding
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Fine Structure and Analysis of Eukaryotic Genes
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Chapter 5 Genome Sequences and Gene Numbers. 5.1Introduction  Genome size vary from approximately 470 genes for Mycoplasma genitalium to 25,000 for human.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
Chapter 21 Eukaryotic Genome Sequences
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Lecture 10 Genes, genomes and chromosomes
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Genomics Chapter 18.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Who is smarter and does more tricks you or a bacteria? YouBacteria How does my DNA compare to a prokaryote? Show-off.
Using DNA Subway in the Classroom Genome Annotation: Red Line.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
”Gene Finding in Eukaryotic Genomes”
Week-6: Genomics Browsers
The Transcriptional Landscape of the Mammalian Genome
 The human genome contains approximately genes.  At any given moment, each of our cells has some combination of these genes turned on & others.
Human Cells Human genomics
Genomes and Their Evolution
Today… Review a few items from last class
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Introduction to Bioinformatics II
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Organisms are made up of cells, cells are largely protein and DNA carries the instructions for the synthesis of those proteins.
Medical genomics BI420 Department of Biology, Boston College
From Mendel to Genomics
Genome Annotation and the Human Genome
Medical genomics BI420 Department of Biology, Boston College
The Content of the Genome
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Genome Annotation and the Human Genome BI420 – Introduction to Bioinformatics Genome Annotation and the Human Genome Fall 2013 Gabor Marth Department of Biology, Boston College

The landscape of the human genome

Goal of Genome Annotation Identify all distinct elements within a genome. Annotation tends to focus on functional elements such as protein coding genes and RNA genes, but may also include non-functional sequences including repetitive elements. protein coding genes repetitive elements RNA genes

The starting material AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT

Coding genes Start codon Stop codon ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA PolyA signal Open Reading Frame = ORF Ab initio - Latin for “from the beginning.” Ab initio gene predictions are those based on computational sequence analysis. Simple approach to gene prediction: look for start codons and stop codons

Typical structure of bacterial and eucaryotic genes Eucaryotic genes have introns while bacterial genes do not.

Ab initio predictions of exons …AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG… splice donor site splice acceptor site

Software for ab initio gene predictions Genscan Grail Genie GeneFinder Glimmer etc… EST_genome Sim4 Spidey

Homology based predictions known coding sequence from another organism expressed sequence ACGGAAGTCT GGACTATAAA ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA genes predicted by homology Genomescan Twinscan

Alternative splicing is difficult to predict ab initio

Ab initio analysis and EST data are integrated for current gene annotations Sim4 dbEst Genewise Grail Genscan FgenesH Ensembl Otto

The current leading tool: Maker 2

Available EST data

Available EST data: Examples

Noncoding RNA genes Prediction based on structure (e.g. tRNAs). Scan the genome and try to fold sequences into shapes corresponding to tRNAs For other novel ncRNAs, only homology-based predictions have been successful, i.e. look for sequences which look like known tRNAs

Noncoding RNAs identified in the Original Human Genome Project (2001)

Long Interspersed NonCoding RNAs Protein-coding gene LINC RNA Protein-coding gene ~ 3000 known Long Interspersed NonCoding (LINC) RNAs known in mammalian genome. Nature 458, 223-227(12 March 2009). This is based on methylation signatures of histones and expression profiling. Histone H3 lysine 4 trimethylation (H3K4me3) Histone H3 lysine 36 trimethylation (H3K36me3)

Types of repeat elements

Types of repeat elements Repetitive sequences make up about half the human genome.

How to annotate repeats Repeat annotations are based on sequence similarity to known repetitive elements in a repeat sequence library

Some facts about the human genome (based on the 2001 Human Genome paper in Nature)

Gene annotations – # of coding genes Note: as of 2013, the estimated number of protein-coding genes in the human genome is between 19000 and 20000

Gene annotations – gene length Human genes have ~7 exons and are ~1100 bp long.

Base Composition Base composition of a sequence A: 5113 C: 5192 G: 2180 T: 4086

Genes tend to be in regions of higher GC content The human genome is approximately 40% GC. Human genes are biased toward regions of higher GC.

Human genes often have similar, so-called duplicate genes

Comparison of tRNAs across species Humans and other eukaryotes have redundant copies of tRNAs.

Comparison of gene repertoires Humans have a large number of genes involved in transcription/translation. Yeasts have a higher fraction of their genes involved in metabolism.

Gene annotations – gene function

Gene conservation across organisms ~1/4 of known human genes occur only in vertebrates <1% of known human genes have homologs only in prokaryotes

“Conclusion” of the Human Genome Paper

The impact: genome anatomy The genome sequence provided the superstructure on which to layer genomic, biological, and medical information Better understanding of the landscape of the human genome (e.g. segmental duplications) Accurate tabulation of protein coding genes Better understanding of the number and role of non-coding genes

The impact: genomic variation The genome sequence provided a substrate on which to organize DNA sequences from other human samples True extent of single-nucleotide variation Linkage disequilibrium Copy number variation Larger structural variation

The impact: medicine Mendelian diseases: 1,000s of single-gene disorders mapped Chromosomal disorders: High-density genomic technologies (e.g. microarrays) made it easier to detect even smaller chromosomal abnormalities Common disease GWAS studies found disease genes Gene lists provide insight into disease pathways Cancer Over 150 genes with somatic mutations playing a role in tumorigenesis, response to cancer drugs, and recurrence

The impact: human history Demographic history, population migrations refined Admixture mapped out on a fine scale Positive selection examined Contribution from Neanderthal DNA

The road ahead New high-throughput sequencing technologies permit sequencing of 1,000s of human genomes Focus on the extent and functional impact of rare, structural, and complex variation Routine use of genetic information in the clinic Routine whole-genome sequencing in the clinic