Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …

Slides:



Advertisements
Similar presentations
Computational Gene Finding using HMMs
Advertisements

ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Transcription & Translation Worksheet
CS262 Lecture 9, Win07, Batzoglou Gene Recognition.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Introduction to bioinformatics Lecture 2 Genes and Genomes.
CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
CS262 Lecture 16, Win07, Batzoglou Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Comparative ab initio prediction of gene structures using pair HMMs
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Gene Prediction: Statistical Approaches Lecture 22.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
Eukaryotic Gene Finding
Introduction to Molecular Biology. G-C and A-T pairing.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Genes: Regulation and Structure Many slides from various sources, including S. Batzoglou,
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010.
Mark D. Adams Dept. of Genetics 9/10/04
Genome Annotation Haixu Tang School of Informatics.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Genomics 101 DNA sequencing Alignment Gene identification
bacteria and eukaryotes
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
More on translation.
Fundamentals of Protein Structure
Python.
Presentation transcript:

Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …

Next Few Topics Gene Recognition Finding genes in DNA with computational methods Large-scale alignment & multiple alignment Comparing whole genomes, or large families of genes Gene Expression and Regulation Measuring the expression of many genes at a time Finding elements in DNA that control the expression of genes

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis

Gene expression Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

Where are the genes?

In humans: ~22,000 genes ~1.5% of human DNA

Finding Genes 1.Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2.Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3.Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4.Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5.Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns

Approaches to gene finding Homology  BLAST, Procrustes. Ab initio  Genscan, Genie, GeneID. Hybrids  GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites 1. Exploit the regular gene structure

Next Exon: Frame 0 Next Exon: Frame 1

2. Recognize “coding bias” Each exon can be in one of three frames ag—gattacagattacagattaca—gtaagFrame 0 ag—gattacagattacagattaca—gtaagFrame 1 ag—gattacagattacagattaca—gtaagFrame 2 Frame of next exon depends on how many nucleotides are left over from previous exon Codons “tag”, “tga”, and “taa” are STOP  No STOP codon appears in-frame, until end of gene  Absence of STOP is called open reading frame (ORF) Different codons appear with different frequencies— coding bias

2. Recognize “coding bias” Amino AcidSLCDNA codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Stop codons StopTAA, TAG, TGA Can map 61 non-stop codons to frequencies & take log-odds ratios

atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

Biology of Splicing (

3. Recognize splice sites ( Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)

5’ 3’ Donor site Position  3. Recognize splice sites

WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997)  Decision-tree algorithm to take pairwise dependencies into account For each position I, calculate S i =  j  i  2 (C i, X j ) Choose i * such that S i* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset  Train separate WMM models for each subset All donor splice sites G5G5 not G 5 G 5 G -1 G 5 not G -1 G 5 G -1 A 2 G 5 G -1 not A 2 G 5 G -1 A 2 U 6 G 5 G -1 A 2 not U 6 3. Recognize splice sites

4.Model the duration of regions

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3 duration Duration HMM for Gene Finding Duration Modeling Introns: regular HMM states—geometric duration Exons: special duration model V E0,0 (i) = max d=1…D { Prob[duration(E0,0)=d]  a Intron0,E0,0   j=i-d+1…i e E0,0 (x j ) } where i is an admissible exon-ending state, D is restricted by the longest ORF GENSCAN: Chris Burge and Sam Karlin, 1997 Best performing de novo gene finder HMM with duration modeling for Exon states

HMM-based Gene Finders GENSCAN (Burge 1997)  Big jump in accuracy of de novo gene finding  Currently, one of the best  HMM with duration modeling for Exon states FGENESH (Solovyev 1997)  Currently one of the best HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

Better way to do it: negative binomial EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3

GENSCAN’s hidden weapon C+G content is correlated with:  Gene content (+)  Mean exon length(+)  Mean intron length (–) These quantities affect parameters of model Solution  Train parameters of model in four different C+G content ranges!

Evaluation of Accuracy (Slide by NF Samatova) Sensitivity (SN)Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding) Specificity (Sp)Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding) Correlation Coefficient (CC) Combined measure of Sensitivity & Specificity Range: -1 (always wrong)  +1 (always right) TP FP TN FN TP FN TN Actual Predicted Coding / No Coding TNFN FPTP Predicted Actual No Coding / Coding

Results of GENSCAN On the initial test dataset (Burset & Guigo)  80% exact exon detection 10% partial exons 10% wrong exons In general  HMMs have been best in de novo prediction  In practice they overpredict human genes by ~2x