Download presentation
Presentation is loading. Please wait.
1
Genome Annotation and the Human Genome
BI420 – Introduction to Bioinformatics Genome Annotation and the Human Genome Fall 2012 Gabor Marth Department of Biology, Boston College
2
The landscape of the human genome
3
Goal of Genome Annotation
Identify all distinct elements within a genome. Annotation tends to focus on functional elements such as protein coding genes and RNA genes, but may also include non-functional sequences including repetitive elements. protein coding genes repetitive elements RNA genes
4
The starting material AGCGTGGTAGCGCGAGTTTGCGAGCTAGCTAGGCTCCGGATGCGA
CCAGCTTTGATAGATGAATATAGTGTGCGCGACTAGCTGTGTGTT GAATATATAGTGTGTCTCTCGATATGTAGTCTGGATCTAGTGTTG GTGTAGATGGAGATCGCGTAGCGTGGTAGCGCGAGTTTGCGAGCT AGCTAGGCTCCGGATGCGACCAGCTTTGATAGATGAATATAGTGT GCGCGACTAGCTGTGTGTTGAATATATAGTGTGTCTCTCGATATGT AGTCTGGATCTAGTGTTGGTGTAGATGGAGATCGCGTGCTTGAG TCGTTCGTTTTTTTATGCTGATGATATAAATATATAGTGTTGGTG GGGGGTACTCTACTCTCTCTAGAGAGAGCCTCTCAAAAAAAAAGCT CGGGGATCGGGTTCGAAGAAGTGAGATGTACGCGCTAGXTAGTAT ATCTCTTTCTCTGTCGTGCTGCTTGAGATCGTTCGTTTTTTTATGCT GATGATATAAATATATAGTGTTGGTGGGGGGTACTCTACTCTCTCT AGAGAGAGCCTCTCAAAAAAAAAGCTCGGGGATCGGGTTCGAAGA AGTGAGATGTACGCGCTAGXTAGTATATCTCTTTCTCTGTCGTGCT
5
Coding genes Start codon Stop codon
ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA PolyA signal Open Reading Frame = ORF Ab initio - Latin for “from the beginning.” Ab initio gene predictions are those based on computational sequence analysis. Simple approach to gene prediction: look for start codons and stop codons
6
Typical structure of bacterial and eucaryotic genes
Eucaryotic genes have introns while bacterial genes do not.
7
Ab initio predictions of exons
…AGAATAGGGCGCGTACCTTCCAACGAAGACTGGG… splice donor site splice acceptor site
8
Software for ab initio gene predictions
Genscan Grail Genie GeneFinder Glimmer etc… EST_genome Sim4 Spidey
9
Homology based predictions
known coding sequence from another organism expressed sequence ACGGAAGTCT GGACTATAAA ATGGCACCACCGATGTCTACGTGGTAGGGGACTATAAAAAAAAAAA genes predicted by homology Genomescan Twinscan
10
Alternative splicing is difficult to predict ab initio
11
Ab initio analysis and EST data are integrated for current gene annotations
Sim4 dbEst Genewise Grail Genscan FgenesH Ensembl Otto
12
Available EST data
13
Available EST data: Examples
14
Noncoding RNA genes Prediction based on structure (e.g. tRNAs).
Scan the genome and try to fold sequences into shapes corresponding to tRNAs For other novel ncRNAs, only homology-based predictions have been successful, i.e. look for sequences which look like known tRNAs
15
Noncoding RNAs identified in the Original Human Genome Project (2001)
16
Long Interspersed NonCoding RNAs
Protein-coding gene LINC RNA Protein-coding gene ~ 3000 known Long Interspersed NonCoding (LINC) RNAs known in mammalian genome. Nature 458, (12 March 2009). This is based on methylation signatures of histones and expression profiling. Histone H3 lysine 4 trimethylation (H3K4me3) Histone H3 lysine 36 trimethylation (H3K36me3)
17
Types of repeat elements
18
Types of repeat elements
Repetitive sequences make up about half the human genome.
19
How to annotate repeats
Repeat annotations are based on sequence similarity to known repetitive elements in a repeat sequence library
20
Some facts about the human genome (2001)
21
Gene annotations – # of coding genes
Note: as of 2011, the estimated number of protein-coding genes in the human genome is between and 20000
22
Gene annotations – gene length
Human genes have ~7 exons and are ~1100 bp long.
23
Base Composition Base composition of a sequence A: 5113 C: 5192
G: 2180 T: 4086
24
Genes tend to be in regions of higher GC content
The human genome is approximately 40% GC. Human genes are biased toward regions of higher GC.
25
Human genes often have similar, so-called duplicate genes
26
Comparison of tRNAs across species
Humans and other eukaryotes have redundant copies of tRNAs.
27
Comparison of gene repertoires
Humans have a large number of genes involved in transcription/translation. Yeasts have a higher fraction of their genes involved in metabolism.
28
Gene annotations – gene function
29
Gene conservation across organisms
~1/4 of known human genes occur only in vertebrates <1% of known human genes have homologs only in prokaryotes
30
“Conclusion” of the Human Genome Paper
31
The impact: genome anatomy
The genome sequence provided the superstructure on which to layer genomic, biological, and medical information Better understanding of the landscape of the human genome (e.g. segmental duplications) Accurate tabulation of protein coding genes Better understanding of the number and role of non-coding genes
32
The impact: genomic variation
The genome sequence provided a substrate on which to organize DNA sequences from other human samples True extent of single-nucleotide variation Linkage disequilibrium Copy number variation Larger structural variation
33
The impact: medicine Mendelian diseases: 1,000s of single-gene disorders mapped Chromosomal disorders: High-density genomic technologies (e.g. microarrays) made it easier to detect even smaller chromosomal abnormalities Common disease GWAS studies found disease genes Gene lists provide insight into disease pathways Cancer Over 150 genes with somatic mutations playing a role in tumorigenesis, response to cancer drugs, and recurrence
34
The impact: human history
Demographic history, population migrations refined Admixture mapped out on a fine scale Positive selection examined Contribution from Neanderthal DNA
35
The road ahead New high-throughput sequencing technologies permit sequencing of 1,000s of human genomes Focus on the extent and functional impact of rare, structural, and complex variation Routine use of genetic information in the clinic Routine whole-genome sequencing in the clinic
36
Mathematical Models of Sequences
37
Sequences and complementarity
DNA sequences are conventionally listed in the 5’ to 3’ direction. 5’ ATGCATGC 3’ This is complementary to the sequence 3’ TACGTACG 5’ Since DNA is double-stranded you could in principle list either sequence, but by convention, the 5’->3’ is always the one described.
38
Probability of a sequence
Independent, identically distributed (IID) model: all positions in a sequence behave identically and independently. This is the simplest model for a sequence of events. Example: There is a 50% chance of sunshine each day, 30% chance of clouds, and 20% chance of rain. Probability of sun on Sunday, clouds on Monday, rain on Tuesday, and rain again on Wednesday? P(sun,cloud,rain,rain) = P(sun)P(cloud)P(rain)P(rain) = 0.5 * 0.3 * 0.2 * 0.2 = 0.006
39
IID Model for DNA Sequences
The probability of an A, C, G or T at a given location is independent of the location. P(AGCCA) = p(A)p(G)p(C)p(C)p(A) s=AGCCA s(1) = A, s(2) =G, s(3) = C, s(4)=C, s(5) = A P(s) = P(s(1)) P(s(2)) P(s(3)) P(s(4)) P(s(5)) Example: suppose P(A)=0.2, P(T)=0.2, P(C) = 0.3, P(G) = 0.3. What is the probability of the sequence AGCCA?
40
Markov models The human genome is approximately 40% GC.
Markov model: a model in which the probability of a base depends on the previous base. So positions are not independent. Analogy: If it is cloudy today it is more likely to rain tomorrow. P(sun,cloud,rain,rain) = π(sun)P(cloud|sun)P(rain|cloud)P(rain|rain) Here π is defined as the probability for day one. Then the P values are the conditional probabilities. For example P(cloud|sun) is the probability it is cloudy today given that it was sunny yesterday.
41
Weather example using a Markov model
Suppose on day one there is a 50% chance of sun, 30% chain of clouds, and 20% chance of rain. Afterwards, the probabilities are given by Prev\Next Sun Cloud Rain 0.8 0.1 0.2 0.4 0.3 P(sun,cloud,rain,rain) = π(sun)P(cloud|sun)P(rain|cloud)P(rain|rain) = 0.5 * 0.1 * 0.4 * 0.4 = 0.008
42
Markov model for a DNA sequence
P(AGCCA) = π(A) p(G|A)p(C|G)p(C|C)p(A|C) P(x|y) is the probability that a base is x given that the previous base was y. Note that we have implicitly assumed the sequence is generated from left to right. Example of a Markov model for sequences base π A 0.3 C 0.2 G T Prev\Next A C G T 0.4 0.2 0.3 0.1 0.7
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.