CSE182-L12 Gene Finding.

Slides:



Advertisements
Similar presentations
Ab initio gene prediction Genome 559, Winter 2011.
Advertisements

SBI 4U November 14 th, What is the central dogma? 2. Where does translation occur in the cell? 3. Where does transcription occur in the cell?
Profiles for Sequences
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
© 2006 W.W. Norton & Company, Inc. DISCOVER BIOLOGY 3/e
Gene Finding (DNA signals) Genome Sequencing and assembly
BME 130 – Genomes Lecture 7 Genome Annotation I – Gene finding & function predictions.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L7 Protein Sequence Analysis using HMMs, Gene Finding.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
Eukaryotic Gene Finding
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.
Chris Chander, Luke Adea BioSci D145 Feb. 12, 2015
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
What was the most interesting thing that you did over Winter Break? Create a double bubble map comparing/contrasting DNA and RNA.
Eukaryotic Gene Finding
Transcription: Synthesizing RNA from DNA
Sequencing a genome and Basic Sequence Alignment
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
DNA Technology- Cloning, Libraries, and PCR 17 November, 2003 Text Chapter 20.
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Protein Synthesis. DNA acts like an "instruction manual“ – it provides all the information needed to function the actual work of translating the information.
RNA Structure and Transcription Mrs. MacWilliams Academic Biology.
Transcription and Translation
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Sequencing a genome and Basic Sequence Alignment
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
CSE182-L9 Gene Finding (DNA signals) Genome Sequencing and assembly.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Chapter 2 From Genes to Genomes. 2.1 Introduction We can think about mapping genes and genomes at several levels of resolution: A genetic (or linkage)
Genetic Engineering Genetic engineering is also referred to as recombinant DNA technology – new combinations of genetic material are produced by artificially.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Finding genes in the genome
CFE Higher Biology DNA and the Genome Transcription.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
DAY 2. Warm Up What type of RNA copies DNA? – mRNA What is this process called? – Transcription.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Fa07CSE182-L8 HMMs, Gene Finding. Fa07CSE182-L8 Midterm 1 In class next Tuesday Syllabus: L1-L8 –Please review HW, questions, and other notes –Bring one.
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Cloning Overview DNA can be cloned into bacterial plasmids for research or commercial applications. The recombinant plasmids can be used as a source of.
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Ab initio gene prediction
Transcription.
DNA and the Genome Key Area 3b Transcription.
Reading Frames and ORF’s
Introduction to Alternative Splicing and my research report
Gene Structure.
Gene Structure.
Presentation transcript:

CSE182-L12 Gene Finding

Silly Quiz Who are these people, and what is the occasion?

Gene Features 5’ UTR intron exon 3’ UTR Acceptor Donor splice site ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

DNA Signals 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Coding versus non-coding Splice Signals Translation start ATG 5’ UTR intron exon 3’ UTR Acceptor Donor splice site Transcription start Translation start

PWMs 321123456 AAGGTGAGT CCGGTAAGT GAGGTGAGG TAGGTAAGG Fixed length for the splice signal. Each position is generated independently according to a distribution Figure shows data from > 1200 donor sites

MDD PWMs do not capture correlations between positions Many position pairs in the Donor signal are correlated

MDD method Choose the position i which has the highest correlation score. Split sequences into two: those which have the consensus at position i, and the remaining. Recurse until <Terminating conditions>

MDD for Donor sites

Gene prediction: Summary Various signals distinguish coding regions from non-coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

How many genes do we have? Nature Science

Alternative splicing

Comparative methods Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

Comparative gene finding tools Genscan/Genie Procrustes/Sim4: mRNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach.

Databases RefSeq and other databases maintain sequences of full-length transcripts. We can query using sequence.

De novo Gene prediction: Summary Various signals distinguish coding regions from non-coding HMMs are a reasonable model for Gene structures, and provide a uniform method for combining various signals. Further improvement may come from improved signal detection

How many genes do we have? Nature Science

Alternative splicing

Comparative methods Gene prediction is harder with alternative splicing. One approach might be to use comparative methods to detect genes Given a similar mRNA/protein (from another species, perhaps?), can you find the best parse of a genomic sequence that matches that target sequence Yes, with a variant on alignment algorithms that penalize separately for introns, versus other gaps.

Comparative gene finding tools Procrustes/Sim4: mRNA vs. genomic Genewise: proteins versus genomic CEM: genomic versus genomic Twinscan: Combines comparative and de novo approach.

Course Sequence Comparison (BLAST & other tools) Protein Motifs: Profiles/Regular Expression/HMMs Protein Sequence Identification via Mass Spec. Discovering protein coding genes Gene finding HMMs DNA signals (splice signals)

Genome Assembly

DNA Sequencing DNA is double-stranded The strands are separated, and a polymerase is used to copy the second strand. Special bases terminate this process early.

A break at T is shown here. Measuring the lengths using electrophoresis allows us to get the position of each T The same can be done with every nucleotide. Color coding can help separate different nucleotides

Automated detectors ‘read’ the terminating bases. The signal decays after 1000 bases.

Sequencing Genomes: Clone by Clone Clones are constructed to span the entire length of the genome. These clones are ordered and oriented correctly (Mapping) Each clone is sequenced individually

Shotgun Sequencing Shotgun sequencing of clones was considered viable However, researchers in 1999 proposed shotgunning the entire genome.

Library Create vectors of the sequence and introduce them into bacteria. As bacteria multiply you will have many copies of the same clone.

Sequencing

Questions Algorithmic: How do you put the genome back together from the pieces? Will be discussed in the next lecture. Statistical? How many pieces do you need to sequence, etc.? The answer to the statistical questions had already been given in the context of mapping, by Lander and Waterman.

Lander Waterman Statistics Island L G

LW statistics: questions As the coverage c increases, more and more areas of the genome are likely to be covered. Ideally, you want to see 1 island. Q1: What is the expected number of islands? Ans: N exp(-c) The number increases at first, and gradually decreases.

Analysis: Expected Number Islands Computing Expected # islands. Let Xi=1 if an island ends at position i, Xi=0 otherwise. Number of islands = ∑i Xi Expected # islands = E(∑i Xi) = ∑i E(Xi)

Prob. of an island ending at i E(Xi) = Prob (Island ends at pos. i) =Prob(clone began at position i-L+1 AND no clone began in the next L-T positions)

LW statistics Pr[Island contains exactly j clones]? Consider an island that has already begun. With probability e-c, it will never be continued. Therefore Pr[Island contains exactly j clones]= Expected # j-clone islands

Expected # of clones in an island Why?

Expected length of an island