Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD

Reminders from last week Polymorphism and mutations Polymorphism and mutations Mapping and Sequencing Mapping and Sequencing Genomic Map Elements Genomic Map Elements Types of Maps Types of Maps Resources Resources Practical Use Practical Use

Polymorphism - Types of variation SNP[snp_class], True single nucleotide polymorphism SNP[snp_class], True single nucleotide polymorphism in-del, Insertion deletion polymorphism; ('-‘/’+’) in-del, Insertion deletion polymorphism; ('-‘/’+’) Microsatellite/simple sequence repeat Microsatellite/simple sequence repeat [FUNC] = Function_Class: [FUNC] = Function_Class: "coding nonsynonymous“ "coding nonsynonymous“ locus region, intron, exception locus region, intron, exception mrna, utr, splice site mrna, utr, splice site “coding synonymous“ “coding synonymous“

Nonsynonymous Mutations Missense – type of nonsynonymous (different amino acid in the product of mutated genes) Missense – type of nonsynonymous (different amino acid in the product of mutated genes) EXAMPLE: sickle-cell disease The replacement of A by T at the 17th nucleotide of the gene for the beta chain of hemoglobin changes the codon GAG (for glutamic acid) to GTG (which encodes valine). Thus the 6th amino acid in the chain becomes valine instead of glutamic acid EXAMPLE: sickle-cell disease The replacement of A by T at the 17th nucleotide of the gene for the beta chain of hemoglobin changes the codon GAG (for glutamic acid) to GTG (which encodes valine). Thus the 6th amino acid in the chain becomes valine instead of glutamic acid

Nonsynonymous Mutations Another example of a missense mutation: In one patient with cystic fibrosis (Patient B), the substitution of a T for a C at nucleotide 1609 converted a glutamine codon (CAG) to a STOP codon (TAG). The protein produced by this patient had only the first 493 amino acids of the normal chain of 1480 and could not function. Another example of a missense mutation: In one patient with cystic fibrosis (Patient B), the substitution of a T for a C at nucleotide 1609 converted a glutamine codon (CAG) to a STOP codon (TAG). The protein produced by this patient had only the first 493 amino acids of the normal chain of 1480 and could not function.

Nonsynonymous Mutations The new nucleotide changes a codon that specified an amino acid to one of the STOP codons (TAA, TAG, or TGA). Therefore, translation of the messenger RNA transcribed from this mutant gene will stop prematurely. The earlier in the gene that this occurs, the more truncated the protein product and the more likely that it will be unable to function. These type of mutations are called nonsense mutations The new nucleotide changes a codon that specified an amino acid to one of the STOP codons (TAA, TAG, or TGA). Therefore, translation of the messenger RNA transcribed from this mutant gene will stop prematurely. The earlier in the gene that this occurs, the more truncated the protein product and the more likely that it will be unable to function. These type of mutations are called nonsense mutations

Insertions and Deletions (Indels) ADRB1[gene] AND human[orgn] AND "in-del"[snp_class ] Base pairs may be added (insertions) or removed (deletions) from the DNA of a gene. The number can range from one to thousands. Base pairs may be added (insertions) or removed (deletions) from the DNA of a gene. The number can range from one to thousands. As a result, translation of the gene can be "frameshifted". Indels of three nucleotides or multiples of three may be less serious. As a result, translation of the gene can be "frameshifted". Indels of three nucleotides or multiples of three may be less serious. Huntington's disease and the fragile X syndrome are examples of trinucleotide repeat diseases caused by insertion Huntington's disease and the fragile X syndrome are examples of trinucleotide repeat diseases caused by insertion

Silent and splice-site mutations For example, if the third base in the TCT codon for serine is changed to any one of the other three bases, serine will still be encoded. Such mutations are said to be silent because they cause no change in protein (synonymous) For example, if the third base in the TCT codon for serine is changed to any one of the other three bases, serine will still be encoded. Such mutations are said to be silent because they cause no change in protein (synonymous) Nucleotide signals at the splice sites guide the enzymatic machinery. If a mutation alters one of these signals, then the intron is not removed and remains as part of the final RNA molecule. This alters the sequence of the protein product. Nucleotide signals at the splice sites guide the enzymatic machinery. If a mutation alters one of these signals, then the intron is not removed and remains as part of the final RNA molecule. This alters the sequence of the protein product.

Types of Maps – see MapViewer Cytogenetic Cytogenetic Genetic Linkage Genetic Linkage Physical Physical Radiation Hybrid Radiation Hybrid Sequence-based Sequence-based

Genomic Map Elements DNA markers, PACR-based: DNA markers, PACR-based: STS STS Polymorphic markers Polymorphic markers RFLPs, VNTRs, SNPs RFLPs, VNTRs, SNPs DNA clones DNA clones BACs and PACs BACs and PACs

Databases & Servers BLAT BLAT MapView MapView GeneCards GeneCards GeneLoc GeneLoc Stanford Source Stanford Source Bioinformatics Harvester Bioinformatics Harvester

Predictive methods using DNA sequences, B&O: chapter 5 Gene Prediction methods Gene Prediction methods Gene Prediction Programs Gene Prediction Programs How good the methods are? How good the methods are? Promoter Analysis Promoter Analysis Strategies and Considerations Strategies and Considerations Markov models HMMs in Gene Prediction Discriminant Analysis in Gene Prediction

Sequence Signals & Gene Structure

UCSC Genome Browser UCSC Genome Browser Ensembl Ensembl NCBI’s Gene Viewer NCBI’s Gene Viewer

What is Computational Gene Finding? Given an uncharacterized DNA sequence, find out: Which region codes for a protein? Which region codes for a protein? Which DNA strand is used to encode the gene? Which DNA strand is used to encode the gene? Which reading frame is used in that strand? Which reading frame is used in that strand? Where does the gene starts and ends? Where does the gene starts and ends? Where are the exon-intron boundaries in eukaryotes? Where are the exon-intron boundaries in eukaryotes? (optionally) Where are the regulatory sequences for that gene? (optionally) Where are the regulatory sequences for that gene?

Gene Prediction Methods 1. Searching by Signal 2. Searching by Content 3. Homology-based Gene Prediction 4. Comparative Gene Prediction Ab initio, “intrinsic”, “template” (1 st and 2 nd ) vs “extrinsic”, “look-up” (3 rd and 4 th )

Eukaryotes vs Prokaryotes Genes separated by intergenic DNA, coding exons separated by large introns vs ORFs adjacent to one another Genes separated by intergenic DNA, coding exons separated by large introns vs ORFs adjacent to one another

Prokaryotic Vs. Eukaryotic Gene Finding Prokaryotes: small genomes 0.5 – 10·10 6 bp small genomes 0.5 – 10·10 6 bp high coding density (>90%) high coding density (>90%) no introns no introns – Gene identification relatively easy, with success rate ~ 99% Problems: overlapping ORFs overlapping ORFs short genes short genes finding TSS and promoters finding TSS and promoters Eukaryotes: large genomes 10 7 – 10 10 bp low coding density (<50%) intron/exon structure – Gene identification a complex problem, gene level accuracy ~50% Problems: many

Gene Structure

Gene Finding: Different Approaches Similarity-based methods (extrinsic) - use similarity to annotated sequences : Similarity-based methods (extrinsic) - use similarity to annotated sequences : proteins proteins cDNAs cDNAs ESTs ESTs Comparative genomics - Aligning genomic sequences from different species Comparative genomics - Aligning genomic sequences from different species Ab initio gene-finding (intrinsic) Ab initio gene-finding (intrinsic) Integrated approaches Integrated approaches

Similarity-based methods Based on sequence conservation due to functional constraints Based on sequence conservation due to functional constraints Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) Limits of the regions of similarity not well defined Limits of the regions of similarity not well defined

Comparative Genomics Based on the assumption that coding sequences are more conserved than non-coding Based on the assumption that coding sequences are more conserved than non-coding Two approaches: Two approaches: intra-genomic (gene families) intra-genomic (gene families) inter-genomic (cross-species) inter-genomic (cross-species) Alignment of homologous regions Alignment of homologous regions Difficult to define limits of higher similarity Difficult to define limits of higher similarity Difficult to find optimal evolutionary distance (pattern of conservation differ between loci) Difficult to find optimal evolutionary distance (pattern of conservation differ between loci)

Summary for Extrinsic Approaches Strengths: Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictionsWeaknesses: Limited to pre-existing biological data Limited to pre-existing biological data Errors in databases Errors in databases Difficult to find limits of similarity Difficult to find limits of similarity

Signal Sensors Signal – a string of DNA recognized by the cellular machinery Signal – a string of DNA recognized by the cellular machinery

Signal Sensors Various pattern recognition method are used for identification of these signals: Various pattern recognition method are used for identification of these signals: consensus sequences consensus sequences weight matrices weight matrices weight arrays weight arrays decision trees decision trees Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) neural networks neural networks …

Example of Consensus Sequence obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interestTACGATTATAATTATAATGATACTTATGATTATGTT consensus sequence consensus (IUPAC) Leads to loss of information and can produce many false positive or false negative predictions TATAAT TATRNT MELON MANGO HONEY SWEET COOKY MONEY

Example of (Positional) Weight Matrix Computed by measuring the frequency of every element of every position of the site (weight) Computed by measuring the frequency of every element of every position of the site (weight) Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score) Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score) Disadvantages: Disadvantages: cut-off value required cut-off value required assumes independence between adjacent bases assumes independence between adjacent bases TACGAT TATAAT GATACT TATGAT TATGTT 123456 A 060340 C 001010 G 100300 T 505016

Example of Decision Tree

Markov Models

Ingredients of a Markov Model Collection of states Collection of states {S 1, S 2, …,S N } State transition probabilities (transition matrix) State transition probabilities (transition matrix) A ij = P(q t+1 = S i | q t = S j ) Initial state distribution Initial state distribution  i = P(q 1 = S i )

Hidden Markov Models

Ingredients of a HMM Collection of states:{S 1, S 2,…,S N } Collection of states:{S 1, S 2,…,S N } State transition probabilities (transition matrix) State transition probabilities (transition matrix) A ij = P(q t+1 = S i | q t = S j ) Initial state distribution Initial state distribution  i = P(q 1 = S i ) Observations:{O 1, O 2,…,O M } Observations:{O 1, O 2,…,O M } Observation probabilities: Observation probabilities: B j (k) = P(v t = O k | q t = S j )

Examples of Gene Finders FGENES – linear DF for content and signal sensors and DP for finding optimal combination of exons GeneMark – HMMs enhanced with ribosomal binding site recognition Genie – neural networks for splicing, HMMs for coding sensors, overall structure modeled by HMM Genscan – WM, WA and decision trees as signal sensors, HMMs for content sensors, overall HMM HMMgene – HMM trained using conditional maximum likelihood Morgan – decision trees for exon classification, also Markov Models MZEF – quadratic DF, predict only internal exons

Genscan Example Developed by Chris Burge 1997 Developed by Chris Burge 1997 One of the most accurate ab initio programs One of the most accurate ab initio programs Uses explicit state duration HMM to model gene structure (different length distributions for exons) Uses explicit state duration HMM to model gene structure (different length distributions for exons) Different model parameters for regions with different GC content Different model parameters for regions with different GC content

Ab initio Gene Finding is Difficult Genes are separated by large intergenic regions Genes are separated by large intergenic regions Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non-coding introns Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non-coding introns in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific Alternative splicing Alternative splicing Repeat elements (>50% in humans) – some contain coding regions Repeat elements (>50% in humans) – some contain coding regions

Problems with Ab initio Gene Finding No biological evidence No biological evidence In long genomic sequences many false positive predictions In long genomic sequences many false positive predictions Prediction accuracy high, but not sufficient Prediction accuracy high, but not sufficient

Evaluation of Gene Finding Programs Calculating accuracy of programs’ predictions Calculating accuracy of programs’ predictions Many evaluation studies, one of the earliest: Many evaluation studies, one of the earliest: Burset and Guigó, 1996 (vertebrate sequences) Burset and Guigó, 1996 (vertebrate sequences) Pavy et al., 1999 (Arabidopsis thaliana) Pavy et al., 1999 (Arabidopsis thaliana) Rogic et al., 2001 (mammalian sequences) Rogic et al., 2001 (mammalian sequences)

Measures of Prediction Accuracy, Nucleotide level accuracy Sensitivity= Sensitivity= Specificity= TN FP FNTN TPFN TP FN REALITY PREDICTION number of correct exons number of actual exons number of correct exons number of predicted exons

Measures of Prediction Accuracy, Part 2 Exon level accuracy REALITY PREDICTION WRONG EXON CORRECT EXON MISSING EXON

41 Integrated Approaches for Gene Finding Programs that integrate results of similarity searches with ab initio techniques (GenomeScan, FGENESH+, Procrustes) Programs that integrate results of similarity searches with ab initio techniques (GenomeScan, FGENESH+, Procrustes) Programs that use synteny between organisms (ROSETTA, SLAM) Programs that use synteny between organisms (ROSETTA, SLAM) Integration of programs predicting different elements of a gene (EuGène) Integration of programs predicting different elements of a gene (EuGène) Combining predictions from several gene finding programs (combination of experts) Combining predictions from several gene finding programs (combination of experts)

Combining Programs’ Predictions Set of methods used and they way they are integrated differs between individual programs Set of methods used and they way they are integrated differs between individual programs Different programs often predict different elements of an actual gene Different programs often predict different elements of an actual gene they could complement each other yielding better prediction they could complement each other yielding better prediction

Gene Prediction Links http://genome.imim.es/geneid.html http://genome.imim.es/geneid.html http://genome.imim.es/geneid.html http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html FGENES, commercial, but can try FGENES, commercial, but can try http://www.softberry.com/berry.phtml?topic=f genes&group=programs&subgroup=gfind http://www.softberry.com/berry.phtml?topic=f genes&group=programs&subgroup=gfind http://www.softberry.com/berry.phtml?topic=f genes&group=programs&subgroup=gfind http://www.softberry.com/berry.phtml?topic=f genes&group=programs&subgroup=gfind

GeneID Hierarchical approach: Hierarchical approach: Splice sites and stop codons predicted and scored using position-specific weight matrices Splice sites and stop codons predicted and scored using position-specific weight matrices Exons built from identified “defining” sites. Scored as the sum of scores of defining sites plus the score of their coding potential Exons built from identified “defining” sites. Scored as the sum of scores of defining sites plus the score of their coding potential Maximization of all the score to assemble gene structure Maximization of all the score to assemble gene structure Latesr versions of the program add sequence similarity searches

GeneScan,Fgenes, Genewise GeneSCAN - Underlying Hidden Markov Model program GeneSCAN - Underlying Hidden Markov Model program FGENES – linear discriminant analysis to identify splice sites, exons, promoter elements FGENES – linear discriminant analysis to identify splice sites, exons, promoter elements Genewise – compares a genomic sequence with a protein sequence or with HMMs representing protein sequences Genewise – compares a genomic sequence with a protein sequence or with HMMs representing protein sequences

How good the methods are? Different methods - different results. How to measure accuracy? Different methods - different results. How to measure accuracy? Sensitivity: proportion of coding nucleotides, exons, genes predicted correctly (true positives) Sensitivity: proportion of coding nucleotides, exons, genes predicted correctly (true positives) Specificity : proportion of predicted elements, genes that are real Specificity : proportion of predicted elements, genes that are real Correlation coefficient combines both Correlation coefficient combines both

Screening Test for Occult Cancer 100 patients with occult cancer: 95 have "x" in their blood 100 patients without occult cancer: 95 do not have "x" in their blood 5 out of every 1000 randomly selected individuals will have occult cancer SENSITIVITY SPECIFICITY PREVALENCE

2 X 2 Table Occult Cancer Present Occult Cancer Absent "x" present"x" absent 100,000 99,500 50025475 4,97594,525 5,45094,550 If a patient has “x” in his blood, chance of occult cancer is 475 / 5475 = 8.7%

Standard Terminology Disease Present Disease Absent Test positive Test negative Entire Population FP + TN TP + FN True Positives (TP’s) False Positives (FP’s) True Negatives (TN’s) False Negatives (FN’s) TP + FP FN + TN

Definitions PV + = PREDICTIVE VALUE = TP + FP = P ( D + | T + ) SPECIFICITY = TN FP + TN = P ( T - | D - ) SENSITIVITY = TP + FN = P ( T + | D + )

What is a “Positive Test”? All the analysis has assumed that it is clear whether a test is positive or negative All the analysis has assumed that it is clear whether a test is positive or negative In reality, many tests involve continuous values so that one result may be “more positive” than another In reality, many tests involve continuous values so that one result may be “more positive” than another How should one define the cut-off at which a test is judged to be abnormal? How should one define the cut-off at which a test is judged to be abnormal?

Continuously Valued Variables Normal Diseased Result “Normal” cutoff False Positives False Negatives True Negatives True Positives

Continuously Valued Variables Normal Diseased Result “Normal” cutoff Fewer false positives (more “conservative”) More false negatives Higher specificity Lower sensitivity

Continuously Valued Variables Normal Diseased Result “Normal” cutoff Fewer false negatives (more “aggressive”) More false positives Higher sensitivity Lower specificity

More on Projects: vaccine development (Ramya) vaccine development (Ramya) http://immunax.dfci.harvard.edu/PEPVAC/ HIV & the Black Plague – Harshal HIV & the Black Plague – Harshal CCR5 - chemokine (C-C motif) receptor 5 CCR5 - chemokine (C-C motif) receptor 5 HIV drug esistance: http://hivdb.stanford.edu/ HIV drug esistance: http://hivdb.stanford.edu/ Gene Annotation – Chris Gene Annotation – Chris Pharmacogenomics – Jennifer Pharmacogenomics – Jennifer

More on Projects: Disease network – Jyoti Disease network – Jyoti Disease networks (gout) – Annie Disease networks (gout) – Annie Genotyping – Nancy Genotyping – Nancy Physiological Genomics – Erin Physiological Genomics – Erin Harmeet – perl program for protein structure analysis? Harmeet – perl program for protein structure analysis?

More on Projects: Human Genetic Variation - Priyanka Human Genetic Variation - Priyanka Cloning (humans) – Parag Cloning (humans) – Parag Evolution – Sukhpreet Evolution – Sukhpreet Metabolic engineering - Danh Metabolic engineering - Danh Protein structure - Tanzeema Protein structure - Tanzeema

Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Similar presentations

Presentation on theme: "Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.

Similar presentations

Presentation on theme: "Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD."— Presentation transcript:

Similar presentations

About project

Feedback