Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.

Slides:



Advertisements
Similar presentations
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Advertisements

Profiles for Sequences
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Transcription & Translation Worksheet
CS262 Lecture 9, Win07, Batzoglou Gene Recognition.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Introduction to bioinformatics Lecture 2 Genes and Genomes.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.
Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …
CSE182-L10 Gene Finding.
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
Comparative ab initio prediction of gene structures using pair HMMs
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Gene Prediction: Statistical Approaches Lecture 22.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
Introduction to Molecular Biology. G-C and A-T pairing.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Dictionaries.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Genes: Regulation and Structure Many slides from various sources, including S. Batzoglou,
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Wellcome Trust Workshop Working with Pathogen Genomes Module 2 Gene Prediction.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Mark D. Adams Dept. of Genetics 9/10/04
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Gene prediction 10. June 2004 Irmtraud Meyer University of Oxford
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Genomics 101 DNA sequencing Alignment Gene identification
bacteria and eukaryotes
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplemental Table 3. Oligonucleotides for qPCR
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
Eukaryotic Gene Finding
Gene architecture and sequence annotation
More on translation.
Pair Hidden Markov Model
Fundamentals of Protein Structure
Python.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
Presentation transcript:

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

Lecture 13, Thursday May 15, 2003 Reading GENSCAN SLAM Twinscan Optional: Chris Burge’s Thesis

Lecture 13, Thursday May 15, 2003 Gene expression Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

Lecture 13, Thursday May 15, 2003 Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = coding intron = non-coding

Lecture 13, Thursday May 15, 2003 Finding genes Start codon ATG 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1Intron 2 Stop codon TAG/TGA/TAA Splice sites

Lecture 13, Thursday May 15, 2003 Approaches to gene finding Homology –BLAST, Procrustes. Ab initio –Genscan, Genie, GeneID. Hybrids –GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM.

HMMs for single species gene finding: Generalized HMMs

Lecture 13, Thursday May 15, 2003 HMMs for gene finding GTCAGAGTAGCAAAGTAGACACTCCAGTAACGC exon intron intergene

Lecture 13, Thursday May 15, 2003 GHMM for gene finding TAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTGGGGGGGGGGGGGGGCCCCCCC Exon1Exon2Exon3

Lecture 13, Thursday May 15, 2003 Observed duration times

Lecture 13, Thursday May 15, 2003 Biology of Splicing (

Lecture 13, Thursday May 15, 2003 Consensus splice sites ( Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)

Lecture 13, Thursday May 15, 2003 Splice Site Models WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) decision-tree like algorithm to take significant pairwise dependencies into account

Lecture 13, Thursday May 15, 2003 Splice site detection 5’ 3’ Donor site Position 

atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

Lecture 13, Thursday May 15, 2003 Coding potential Amino AcidSLCDNA codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG Stop codons StopTAA, TAG, TGA

Lecture 13, Thursday May 15, 2003 Comparison of 1196 orthologous genes (Makalowski et al., 1996) Sequence identity: –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

Lecture 13, Thursday May 15, 2003 HumanMouse Human-mouse homology

Lecture 13, Thursday May 15, 2003

Not always: HoxA human-mouse

Lecture 13, Thursday May 15, :. :. :. :. : 247 GGTGAGGTCGAGGACCCTGCA CGGAGCTGTATGGAGGGCA AGAGC |: || ||||: |||| --:|| ||| |::| |||---|||| 368 GAGTCGGGGGAGGGGGCTGCTGTTGGCTCTGGACAGCTTGCATTGAGAGG 100. :. :. :. :. : 292 TTC CTACAGAAAAGTCCCAGCAAGGAGCCACACTTCACTG ||| || | |::| |: ||||::|:||:-|| ||:| | 418 TTCTGGCTACGCTCTCCCTTAGGGACTGAGCAGAGGGCT CAGGTCGCGG 150. :. :. :. :. : 332 ATGTCGAGGGGAAGACATCATTCGGGATGTCAGTG ||||||||||||||||||||||:|||||||||||| 467 TGGGAGATGAGGCCAATGTCGAGGGGAAGACATCATTTGGGATGTCAGTG 200. :. :. :. :. : 367 TTCAACCTCAGCAATGCCATCATGGGCAGCGGCATCCTGGGACTCGCCTA |||||:||||||||:||||||||||||||:|| ||:|||||:|||||||| 517 TTCAATCTCAGCAACGCCATCATGGGCAGTGGAATTCTGGGGCTCGCCTA Alignment

Lecture 13, Thursday May 15, 2003 Twinscan Twinscan is an augmented version of the Gencscan HMM. E I transitions duration emissions ACUAUACAGACAUAUAUCAU

Lecture 13, Thursday May 15, 2003 Twinscan Algorithm 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

Lecture 13, Thursday May 15, 2003 Twinscan Algorithm (cont’d) 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Note: Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns

Lecture 13, Thursday May 15, 2003 Example Human : ACGGCGACUGUGCACGU Mouse : ACUGUGAC GUGCACUU Align : ||:|:|||-||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| U- G| U| G| C| A| C| G: U| Recall, e E (A|) > e I (A|) e E (A-) < e I (A-) Likely exon

HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

Lecture 13, Thursday May 15, 2003 A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j )  1-  - 2        BEGIN END M J I

Lecture 13, Thursday May 15, 2003 Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 CNS [human] [mouse] Exon = coding Intron = non-coding CNS = conserved non-coding

Lecture 13, Thursday May 15, 2003 Generalized Pair HMMs

Lecture 13, Thursday May 15, 2003 Ingredients in exon scores Splice site detection (VLMM) Length distribution (generalized) Coding potential (codon freq. tables) GC-stratification

Lecture 13, Thursday May 15, 2003 Exon GPHMM d e 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

Lecture 13, Thursday May 15, 2003 The SLAM hidden Markov model

Lecture 13, Thursday May 15, 2003 Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

Lecture 13, Thursday May 15, 2003 no. states max duration length seq1 length seq2 Computational complexity

Lecture 13, Thursday May 15, 2003 Approximate alignment Reduces TU -factor to hT

Lecture 13, Thursday May 15, 2003 Measuring Performance Definition: –Sensitivity (SN): (# correctly predicted)/(# true) –Specificity (SP): (# correctly predicted)/(# total predicted)

Lecture 13, Thursday May 15, 2003 Measuring Performance