Genomics 101 DNA sequencing Alignment Gene identification

Genomics 101 DNA sequencing Alignment Gene identification
Gene expression Genome evolution … In this course a lot of our focus will be in biological sequences, and especially DNA, which is the topic of genomics and really the key to understanding life at the molecular level. An average human is composed of trillions of cells, with some small variations across humans. Same for our close relatives. Each cell contains a nucleus, which has a copy of our entire DNA….

Next Few Topics Gene Recognition
Finding genes in DNA with computational methods Large-scale alignment & multiple alignment Comparing whole genomes, or large families of genes Gene Expression and Regulation Measuring the expression of many genes at a time Finding elements in DNA that control the expression of genes

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis

Gene expression DNA RNA Protein PEPTIDE transcription translation
CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA But first some quick genetics to make sure that we all are on the same page. The genes are expressed by the DNA in our chromosomes being transcribed into RNA. Basically an enzyme attaches to the DNA molecule and copies it, creating a RNA molecule. Then the RNA is translated in triplets into amin acids, creating a peptide, which in its finished form becomes a protein. The protein is the end product of the gene, and thus the expression of the genes. PEPTIDE

Gene structure transcription splicing translation
intron1 intron2 exon1 exon2 exon3 transcription splicing The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. translation Codon: A triplet of nucleotides that is converted to one amino acid exon = protein-coding intron = non-coding

Where are the genes?

In humans: ~22,000 genes ~1.5% of human DNA

Finding Genes Exploit the regular gene structure
ATG—Exon1—Intron1—Exon2—…—ExonN—STOP Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns

Approaches to gene finding
Homology BLAST, Procrustes. Ab initio Genscan, Genie, GeneID. Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. Annotating features of biological importance is relatively straightforward for organisms with compact genomes such as bacteria and yeast, because exons tend to be large and the introns small or non-existent. In large genomes as for mammals the coding signal is scattered in a vast sea of non-coding noise. The crudest, yet often most reliable approach has been to manually look for exons by integrating and filtering various sources of homology information. This is based on the fact that exons and regulatory regions tend to be more strongly conserved by evolution than random genomic sequences. The tool that is frequently used is called BLAST. Procrustes is a program that attempts to integrate BLAST based gene finding with ‘ab initio methods'. Ab inito approaches, such as the following directly use only information about the input sequence itself to identify likely splice sites and to detect differences in sequence composition between coding and non-coding DNA. One of the best of this sort is GENSCAN, which uses a Hidden Markov Model to scan large genomic sequences. Then there are various forms of hybrids. Here we mean software that takes extra information from a second sequence, or database, in some sense. GenomeScan and GenieEST, uses protein and EST databases respectively to confirm the potential exons. Twinscan and SGP both have a single species algorithm as base, take two sequences as input and use the second to boost the exon probability if the alignment is good. ROSETTA is the first cross-species gene finder, using a anchor based approach to determine homologous exon, and then a heuristic scoring scheme for the final prediction. CEM is similar to ROSETTA. TBLASTX is another variant of BLAST taking a nucleotide query and a nucleotide database as input, but translating both the query and the database into peptides in all reading frames before searching for matches. SLAM uses a generalized pair HMM, that is a merging of the Genscan HMM and a pair HMM, usually used for alignments.

1. Exploit the regular gene structure
5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Start codon ATG Stop codon TAG/TGA/TAA The problem of predicting genes means to give coordinates for the exon boundaries. The first kind of information that prediction algorithms use, is the regular structure of a gene. Every gene starts with an ATG codon, and then exons alternate with introns; at the exon-intron boundaries, the splice sites, there are short words that are approximately preserved. Splice sites

Next Exon: Frame 0 Next Exon: Frame 1

2. Recognize “coding bias”
Each exon can be in one of three frames ag—gattacagattacagattaca—gtaag Frame 0 ag—gattacagattacagattaca—gtaag Frame 1 ag—gattacagattacagattaca—gtaag Frame 2 Frame of next exon depends on how many nucleotides are left over from previous exon Codons “tag”, “tga”, and “taa” are STOP No STOP codon appears in-frame, until end of gene Absence of STOP is called open reading frame (ORF) Different codons appear with different frequencies—coding bias

2. Recognize “coding bias”
Amino Acid SLC DNA codons Isoleucine I ATT, ATC, ATA Leucine L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F TTT, TTC Methionine M ATG Cysteine C TGT, TGC Alanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCG Threonine T ACT, ACC, ACA, ACG Serine S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamine Q CAA, CAG Asparagine N AAT, AAC Histidine H CAT, CAC Glutamic acid E GAA, GAG Aspartic acid D GAT, GAC Lysine K AAA, AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA Can map 61 non-stop codons to frequencies & take log-odds ratios

atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

Biology of Splicing (

3. Recognize splice sites
Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) How much info? Find branch sites (

Donor site 5’ 3’ Position %

WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account For each position I, calculate Si = ji2(Ci, Xj) Choose i* such that Si* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset Train separate WMM models for each subset G5 G5G-1 G5G-1 A2 G5G-1 A2U6 All donor splice sites not G5 G5 not G-1 G5G-1 not A2 G5G-1A2 not U6

4. Model the duration of regions
Why the algorithm has to be generalized is because in a standard HMM the output in each step would be one base, leading to state durations of geometric length. If we look at empirical data, such as these plots, we see that the intron lengths seem to follow the geometric distribution fairly well, but for the exon that would be a pretty bad model. So in our state space the exons are generalized states, choosing a length from a general distribution and outputting the entire exon in one step, while the intron and intergene states still output one base at a time and follow the geometric distribution.

Hidden Markov Models for Gene Finding
First Exon State Intergene State Intron State exon intron intergene And that’s what we want to do when predicting genes. A Markov chain is a random process where the next step only depends on where you’re at at the moment. For the dice, which die to role next only depended on which you just roled and not on the history. A hidden Markov model simply means that you don’t observe the state sequence, the sequence of roled dice, directly, but something depending on it, and so the state sequence is hidden from you. In the gene finding context you observe the DNA sequence and the state sequence you want to determine consists of exons, introns and intergene states. That is you want to determine for each DNA base which state it belongs to, and thus predict the exon boundaries. GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Duration HMM for Gene Finding
Duration Modeling Introns: regular HMM states—geometric duration Exons: special duration model VE0,0(i) = maxd=1…D { Prob[duration(E0,0)=d]aIntron0,E0,0 j=i-d+1…ieE0,0(xj) } where i is an admissible exon-ending state, D is restricted by the longest ORF GENSCAN: Chris Burge and Sam Karlin, 1997 Best performing de novo gene finder HMM with duration modeling for Exon states This is the state space of the Generalized HMM used by for instance Genscan and Genie, performing single species gene finding. The hidden states are the exons in red and the introns and intergene in green. The reason to why we have so many states is because the sequence is translated into protein in triplets, so there are three different ways to translate the same sequence, or three different reading frames. The end product has to be a sequence divisible by three, and if one exon ends in the middle of a codon, that codon has to be finished in the beginning of the next codon. Thus in each exon we would have to remember the number of extra bases in the previous exon, that is two states ago, which violates the Markov model assumption that the transition to the next state only depends on the current. If we instead have one intron for each reading frame, we so to speak bring the information with us, preserving the Markov property. duration Exon1 Exon2 Exon3 T A G C

HMM-based Gene Finders
GENSCAN (Burge 1997) Big jump in accuracy of de novo gene finding Currently, one of the best HMM with duration modeling for Exon states FGENESH (Solovyev 1997) Currently one of the best HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

Better way to do it: negative binomial
EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3 Non-parametric duration density is too expensive Better to use some alternative density, such as negative binomial

GENSCAN’s hidden weapon
C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–) These quantities affect parameters of model Solution Train parameters of model in four different C+G content ranges!

Evaluation of Accuracy
TP FP TN FN TP FN TN Actual Predicted TN FN FP TP Predicted Actual No Coding / Coding Coding / No Coding Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding) Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding) Correlation Coefficient (CC) Combined measure of Sensitivity & Specificity Range: -1 (always wrong)  +1 (always right) (Slide by NF Samatova)

Results of GENSCAN On the initial test dataset (Burset & Guigo)
80% exact exon detection 10% partial exons 10% wrong exons In general HMMs have been best in de novo prediction In practice they overpredict human genes by ~2x

Genomics 101 DNA sequencing Alignment Gene identification

Similar presentations

Presentation on theme: "Genomics 101 DNA sequencing Alignment Gene identification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Genomics 101 DNA sequencing Alignment Gene identification

Similar presentations

Presentation on theme: "Genomics 101 DNA sequencing Alignment Gene identification"— Presentation transcript:

Similar presentations

About project

Feedback