Genomics 101 DNA sequencing Alignment Gene identification

Slides:



Advertisements
Similar presentations
The genetic code.
Advertisements

 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
RNA Say Hello to DNA’s little friend!. EngageEssential QuestionExplain Describe yourself to long lost uncle. How do the mechanisms of genetics and the.
Supplementary Fig.1: oligonucleotide primer sequences.
CS262 Lecture 9, Win07, Batzoglou Gene Recognition.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
Eukaryotic Gene Finding
Introduction to Molecular Biology. G-C and A-T pairing.
Eukaryotic Gene Finding
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Reading the blueprint of life DNA sequencing. Introduction The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
Proteins are made by decoding the Information in DNA Proteins are not built directly from DNA.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Genes: Regulation and Structure Many slides from various sources, including S. Batzoglou,
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
RNA Structure Like DNA, RNA is a nucleic acid. RNA is a nucleic acid made up of repeating nucleotides.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Do Now Look at the picture below and answer the following questions.
NSCI 314 LIFE IN THE COSMOS 4 - The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
Structure and Function of DNA DNA Replication and Protein Synthesis.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
G U A C G U A C C A U G G U A C A C U G UUU UUC UUA UCU UUG UCC UCA
ORF Calling.
bacteria and eukaryotes
From DNA to Protein.
Translation PROTEIN SYNTHESIS.
Whole process Step by step- from chromosomes to proteins.
RNA and Protein Synthesis
Protein Synthesis DNA RNA Protein.
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Biology Chapter 9 Section 2 Part 2
Section Objectives Relate the concept of the gene to the sequence of nucleotides in DNA. Sequence the steps involved in protein synthesis.
Protein Synthesis Translation.
DNA By: Mr. Kauffman.
Gene architecture and sequence annotation
Ab initio gene prediction
More on translation.
Transcription You’re made of meat, which is made of protein.
Fundamentals of Protein Structure
Transcription and Translation
Transcription and Translation
Python.
Bellringer Please answer on your bellringer sheet:
Presentation transcript:

Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution … In this course a lot of our focus will be in biological sequences, and especially DNA, which is the topic of genomics and really the key to understanding life at the molecular level. An average human is composed of trillions of cells, with some small variations across humans. Same for our close relatives. Each cell contains a nucleus, which has a copy of our entire DNA….

Next Few Topics Gene Recognition Finding genes in DNA with computational methods Large-scale alignment & multiple alignment Comparing whole genomes, or large families of genes Gene Expression and Regulation Measuring the expression of many genes at a time Finding elements in DNA that control the expression of genes

Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov

Reading GENSCAN EasyGene SLAM Twinscan Optional: Chris Burge’s Thesis

Gene expression DNA RNA Protein PEPTIDE transcription translation CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA But first some quick genetics to make sure that we all are on the same page. The genes are expressed by the DNA in our chromosomes being transcribed into RNA. Basically an enzyme attaches to the DNA molecule and copies it, creating a RNA molecule. Then the RNA is translated in triplets into amin acids, creating a peptide, which in its finished form becomes a protein. The protein is the end product of the gene, and thus the expression of the genes. PEPTIDE

Gene structure transcription splicing translation intron1 intron2 exon1 exon2 exon3 transcription splicing The genes themselves are structured in coding bits, that is the stuff that becomes amino acids, called exons, and non-coding stretches of sequence in between, called introns. When the gene is transcribed the whole thing becomes an RNA molecule, including the garbage in between the exons, and then these introns are cut out in a process called splicing. The resulting bits are glued together and translated into a protein. translation Codon: A triplet of nucleotides that is converted to one amino acid exon = protein-coding intron = non-coding

Where are the genes?

In humans: ~22,000 genes ~1.5% of human DNA

Finding Genes Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns

Approaches to gene finding Homology BLAST, Procrustes. Ab initio Genscan, Genie, GeneID. Hybrids GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. Annotating features of biological importance is relatively straightforward for organisms with compact genomes such as bacteria and yeast, because exons tend to be large and the introns small or non-existent. In large genomes as for mammals the coding signal is scattered in a vast sea of non-coding noise. The crudest, yet often most reliable approach has been to manually look for exons by integrating and filtering various sources of homology information. This is based on the fact that exons and regulatory regions tend to be more strongly conserved by evolution than random genomic sequences. The tool that is frequently used is called BLAST. Procrustes is a program that attempts to integrate BLAST based gene finding with ‘ab initio methods'. Ab inito approaches, such as the following directly use only information about the input sequence itself to identify likely splice sites and to detect differences in sequence composition between coding and non-coding DNA. One of the best of this sort is GENSCAN, which uses a Hidden Markov Model to scan large genomic sequences. Then there are various forms of hybrids. Here we mean software that takes extra information from a second sequence, or database, in some sense. GenomeScan and GenieEST, uses protein and EST databases respectively to confirm the potential exons. Twinscan and SGP both have a single species algorithm as base, take two sequences as input and use the second to boost the exon probability if the alignment is good. ROSETTA is the first cross-species gene finder, using a anchor based approach to determine homologous exon, and then a heuristic scoring scheme for the final prediction. CEM is similar to ROSETTA. TBLASTX is another variant of BLAST taking a nucleotide query and a nucleotide database as input, but translating both the query and the database into peptides in all reading frames before searching for matches. SLAM uses a generalized pair HMM, that is a merging of the Genscan HMM and a pair HMM, usually used for alignments.

1. Exploit the regular gene structure 5’ 3’ Exon 1 Exon 2 Exon 3 Intron 1 Intron 2 Start codon ATG Stop codon TAG/TGA/TAA The problem of predicting genes means to give coordinates for the exon boundaries. The first kind of information that prediction algorithms use, is the regular structure of a gene. Every gene starts with an ATG codon, and then exons alternate with introns; at the exon-intron boundaries, the splice sites, there are short words that are approximately preserved. Splice sites

Next Exon: Frame 0 Next Exon: Frame 1

2. Recognize “coding bias” Each exon can be in one of three frames ag—gattacagattacagattaca—gtaag Frame 0 ag—gattacagattacagattaca—gtaag Frame 1 ag—gattacagattacagattaca—gtaag Frame 2 Frame of next exon depends on how many nucleotides are left over from previous exon Codons “tag”, “tga”, and “taa” are STOP No STOP codon appears in-frame, until end of gene Absence of STOP is called open reading frame (ORF) Different codons appear with different frequencies—coding bias

2. Recognize “coding bias” Amino Acid SLC DNA codons Isoleucine I ATT, ATC, ATA Leucine L CTT, CTC, CTA, CTG, TTA, TTG Valine V GTT, GTC, GTA, GTG Phenylalanine F TTT, TTC Methionine M ATG Cysteine C TGT, TGC Alanine A GCT, GCC, GCA, GCG Glycine G GGT, GGC, GGA, GGG Proline P CCT, CCC, CCA, CCG Threonine T ACT, ACC, ACA, ACG Serine S TCT, TCC, TCA, TCG, AGT, AGC Tyrosine Y TAT, TAC Tryptophan W TGG Glutamine Q CAA, CAG Asparagine N AAT, AAC Histidine H CAT, CAC Glutamic acid E GAA, GAG Aspartic acid D GAT, GAC Lysine K AAA, AAG Arginine R CGT, CGC, CGA, CGG, AGA, AGG Stop codons Stop TAA, TAG, TGA Can map 61 non-stop codons to frequencies & take log-odds ratios

atg caggtg ggtgag cagatg ggtgag cagttg ggtgag caggcc ggtgag tga

Biology of Splicing (http://genes.mit.edu/chris/)

3. Recognize splice sites Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996) How much info? 7.9 9.4 Find branch sites (http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)

3. Recognize splice sites Donor site 5’ 3’ Position %

3. Recognize splice sites WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account For each position I, calculate Si = ji2(Ci, Xj) Choose i* such that Si* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset Train separate WMM models for each subset G5 G5G-1 G5G-1 A2 G5G-1 A2U6 All donor splice sites not G5 G5 not G-1 G5G-1 not A2 G5G-1A2 not U6

4. Model the duration of regions Why the algorithm has to be generalized is because in a standard HMM the output in each step would be one base, leading to state durations of geometric length. If we look at empirical data, such as these plots, we see that the intron lengths seem to follow the geometric distribution fairly well, but for the exon that would be a pretty bad model. So in our state space the exons are generalized states, choosing a length from a general distribution and outputting the entire exon in one step, while the intron and intergene states still output one base at a time and follow the geometric distribution.

Hidden Markov Models for Gene Finding First Exon State Intergene State Intron State exon intron intergene And that’s what we want to do when predicting genes. A Markov chain is a random process where the next step only depends on where you’re at at the moment. For the dice, which die to role next only depended on which you just roled and not on the history. A hidden Markov model simply means that you don’t observe the state sequence, the sequence of roled dice, directly, but something depending on it, and so the state sequence is hidden from you. In the gene finding context you observe the DNA sequence and the state sequence you want to determine consists of exons, introns and intergene states. That is you want to determine for each DNA base which state it belongs to, and thus predict the exon boundaries. GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Hidden Markov Models for Gene Finding First Exon State Intergene State Intron State exon intron intergene And that’s what we want to do when predicting genes. A Markov chain is a random process where the next step only depends on where you’re at at the moment. For the dice, which die to role next only depended on which you just roled and not on the history. A hidden Markov model simply means that you don’t observe the state sequence, the sequence of roled dice, directly, but something depending on it, and so the state sequence is hidden from you. In the gene finding context you observe the DNA sequence and the state sequence you want to determine consists of exons, introns and intergene states. That is you want to determine for each DNA base which state it belongs to, and thus predict the exon boundaries. GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Duration HMM for Gene Finding Duration Modeling Introns: regular HMM states—geometric duration Exons: special duration model VE0,0(i) = maxd=1…D { Prob[duration(E0,0)=d]aIntron0,E0,0 j=i-d+1…ieE0,0(xj) } where i is an admissible exon-ending state, D is restricted by the longest ORF GENSCAN: Chris Burge and Sam Karlin, 1997 Best performing de novo gene finder HMM with duration modeling for Exon states This is the state space of the Generalized HMM used by for instance Genscan and Genie, performing single species gene finding. The hidden states are the exons in red and the introns and intergene in green. The reason to why we have so many states is because the sequence is translated into protein in triplets, so there are three different ways to translate the same sequence, or three different reading frames. The end product has to be a sequence divisible by three, and if one exon ends in the middle of a codon, that codon has to be finished in the beginning of the next codon. Thus in each exon we would have to remember the number of extra bases in the previous exon, that is two states ago, which violates the Markov model assumption that the transition to the next state only depends on the current. If we instead have one intron for each reading frame, we so to speak bring the information with us, preserving the Markov property. duration Exon1 Exon2 Exon3 T A G C

HMM-based Gene Finders GENSCAN (Burge 1997) Big jump in accuracy of de novo gene finding Currently, one of the best HMM with duration modeling for Exon states FGENESH (Solovyev 1997) Currently one of the best HMMgene (Krogh 1997) GENIE (Kulp 1996) GENMARK (Borodovsky & McIninch 1993) VEIL (Henderson, Salzberg, & Fasman 1997)

Better way to do it: negative binomial EasyGene: Prokaryotic gene-finder Larsen TS, Krogh A Negative binomial with n = 3 Non-parametric duration density is too expensive Better to use some alternative density, such as negative binomial

GENSCAN’s hidden weapon C+G content is correlated with: Gene content (+) Mean exon length (+) Mean intron length (–) These quantities affect parameters of model Solution Train parameters of model in four different C+G content ranges!

Evaluation of Accuracy TP FP TN FN TP FN TN Actual Predicted TN FN FP TP Predicted Actual No Coding / Coding Coding / No Coding Sensitivity (SN) Fraction of exons (coding nucleotides) whose boundaries are predicted exactly (that are predicted as coding) Specificity (Sp) Fraction of the predicted exons (coding nucleotides) that are exactly correct (that are coding) Correlation Coefficient (CC) Combined measure of Sensitivity & Specificity Range: -1 (always wrong)  +1 (always right) (Slide by NF Samatova)

Results of GENSCAN On the initial test dataset (Burset & Guigo) 80% exact exon detection 10% partial exons 10% wrong exons In general HMMs have been best in de novo prediction In practice they overpredict human genes by ~2x