1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Recombinant DNA Technology
EVOLUTIONARY CHANGE IN DNA SEQUENCES - usually too slow to monitor directly… … so use comparative analysis of 2 sequences which share a common ancestor.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Introduction to Bioinformatics
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
PROMoter SCanning/ANalysis tool. Goal Creating a tool to analyse a set of putative promoter sequences and recognize known and unknown promoters, with.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Bioinformatics and Phylogenetic Analysis
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
DNA Sequencing and Gene Analysis
Review of Laboratory 3 Spectrophotometric determination of DNA quantity, purity Abs 260 nmAbs 280 nmAbs 320 nmAbs 260/Abs
Human Genome Project. Basic Strategy How to determine the sequence of the roughly 3 billion base pairs of the human genome. Started in Various side.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Biological Motivation Gene Finding in Eukaryotic Genomes
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Gene Structure and Identification
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Fine Structure and Analysis of Eukaryotic Genes
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
How do you identify and clone a gene of interest? Shotgun approach? Is there a better way?
Library screening Heterologous and homologous gene probes Differential screening Expression library screening.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
1 The Interrupted Gene. Ex Biochem c3-interrupted gene Introduction Figure 3.1.
Expression of the Genome The transcriptome. Decoding the Genetic Information  Information encoded in nucleotide sequences contained in discrete units.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
You have worked for 2 years to isolate a gene involved in axon guidance. You sequence the cDNA clone that contains axon guidance activity. What do you.
Recombinant DNA Technology and Genomics A.Overview: B.Creating a DNA Library C.Recover the clone of interest D.Analyzing/characterizing the DNA - create.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Chapter 11: Functional genomics
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
While replication, one strand will form a continuous copy while the other form a series of short “Okazaki” fragments Genetic traits can be transferred.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Chapter 3 The Interrupted Gene.
Step 3: Tools Database Searching
Finding genes in the genome
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Expression of the Genome
Expression of the Genome
Relationship between Genotype and Phenotype
Chapter 14 Bioinformatics—the study of a genome
Sequencing Data Analysis
Genome Center of Wisconsin, UW-Madison
Chapter 4 The Interrupted Gene.
Identification and differential expression of human collagenase-3 mRNA species derived from internal deletion, alternative splicing, and different polyadenylation.
Volume 10, Issue 8, Pages (April 2000)
Basic Local Alignment Search Tool
Sequencing Data Analysis
Presentation transcript:

1. Bacterial genomes - genes tightly packed, no introns... HOW TO FIND GENES WITHIN A DNA SEQUENCE? Scan for ORFs (open reading frames) - check all 6 reading frames (both strands) - look for significant distance between potential start and stop codon (eg 100 codons) Fig. 5.2 … but when examining short sequences, start codon (or stop codon) might be located further upstream (or downstream)

- if initiation codon other than ATG (relatively rare) - if overlapping genes (rare) Potential problems? - if gene contains intron(s) Use computer programs to search for ORFs: Query: 3 kb sequence - if deviation from standard genetic code (can change default)

2. Eukaryotic genomes (such as human) - genes usually far apart, long introns & short exons Fig.5.4 Would an ORF scan work here?

Can also use algorithms to look for: 1. Exon-intron boundaries - “GT-AG” rule, but consensus sequences very short 2. Regulatory motifs - upstream promoters, downstream polyA addition signals… - but consensus sequences usually very short 3. Codon bias patterns - synonymous codons are not all used equally - patterns differ among organisms Table 5.1, Brown1 st ed (see Fig.5.5) See Fig which shows results from various bioinformatics tools used to analyze 15 kb of human genome

BLAST searches Basic Local Alignment Search Tool - search programs to look for similarity between your sequence of interest (protein or DNA) and entries in global data banks BLASTP – search at protein level BLASTN – search at nucleotide level BLASTX – search nt sequence against protein databases (automatic 6-reading frame conceptual translation) 4. Homologous sequences in databank tBLASTN – protein query vs. conceptual translation of DNA database

Query = yeast mitochondrial ribosomal protein L8 (238 aa) Fungal Bacterial

E-values: statistical measure of likelihood that sequences with this degree of similarity occur randomly ie. reflects number of hits expected by chance Nomenclature may differ among organisms - called L17 in Streptococcus but L8 in yeast

Query = yeast mitochondrial ribosomal protein L8 gene (including promoter & UTRs) What if this search was done at nucleotide (instead of protein) level? Only got “hits” with other yeast entries, in this case Homologous genes from divergent organisms typically show greater similarity at amino acid level than at nt level Degeneracy of genetic code Codon bias among organisms Probability of specific stretch of nucleotides occurring by random chance (“spurious hits”) is higher than for the same length of amino acids

To illustrate the power of amino acid level searches, text shows 2 sequences with 76% nt identity … but only 28% aa identity But it’s a rather artificial example… Fig because if 2 DNA stretches of 300 bp or so (normal default length in ORF Finder) showed 76% nt identity, it’s very improbable that such similarity occurred by chance

HOMOLOGOUS GENES (share common evolutionary origin) 1. Orthologous - homologous genes in different organisms 2. Paralogous - homologous genes in same organism (eg.  -globin genes from mouse and human) (eg. multi-gene family members,  -globin and  -globin from mouse) Two genes are either evolutionarily related or they are not …. so instead of “…% homologous”, use “… % identity” (p.145)

ARE TWO SEQUENCES HOMOLOGOUS OR INDEPENDENT IN ORIGIN? Factors to consider: 1.Length of sequence - short sequences more likely to occur by chance 2. Base composition - highly biased (eg if only AT) more likely to occur by chance 3. Similarity at amino acid level (if protein-coding region) - high % identity is strong argument for homology - usually implies common protein function - nt changes such that minimal effect on aa sequence “low complexity regions”

- score of % nt sequence similarity (blocks compared vs. reference sequence) “Numbered boxes correspond to exons” - gives overview of sequence relationships for genomic region shared among organisms Comparison of homologous regions from multiple genomes Thomas Nature 424:788, 2003 Human chr 7 (1.8 Mbp region) MultiPipMaker program (percent identity plot)

1. Zoo blot (Southern) analysis Fig Heterologous hybridization - use conditions of “reduced stringency” (eg lower temp) so that duplex hybrids with some mismatches are stable - find regions homologous to DNA from other organisms - to determine presence/absence of gene among different organisms EXPERIMENTAL TECHNIQUES TO FIND GENES Interpretation of data shown in figure?

Brain Kidney Heart Lung Liver Strachan & Read Fig In situ hybridization - to determine cellular location 35 S-labeled  -myosin antisense probe hybridizing to heart ventricle in 13-day embryonic mouse gene X probe Probe: tagged DNA (eg. PCR product, restriction fragment, cDNA clone…) in denatured form or oligomer or antisense (synthetic) RNA … Fig Northern blot analysis - to identify expressed regions of genomes (detect transcripts) (but note that many identical copies of that particular mRNA are present on blot)

Some protein genes are constitutively expressed … “housekeeping gene” products needed at all times … whereas others are differentially expressed Only a subset of genes are expressed at a given time and mRNA levels can vary greatly among genes during development in specific tissue type in response to environmental cues ~10,000 – 15,000 different mRNAs present in “typical” mammalian cell type under given condition (may be ~ 20,000 different proteins present) Aside: RNA-sequencing studies suggest ~ 8000 genes ubiquitously expressed in human tissues (Ramskold PLoS 2009) (higher than predicted from microarray analysis, to be discussed in Topic 7)

Fig.3.36 ESTs - short sequences obtained by sequence analysis of cDNA clones 5’cap AAAAAAAAAn eg. for primer can use mixture of “anchored” oligo(dT)s with A, C or G in the 3’ position 3’ 5’ …. or cDNA maybe not full-length If so, which end would you expect to be missing? … but if low abundance mRNA may not be in bank EXPERIMENTAL TECHNIQUES TO FIND CODING REGIONS WITHIN GENES 1. Sequencing of cDNA (or EST) clones... & compare to genomic sequences to determine positions of introns

Human phosphatidylinositol glycan gene (chromosome 18) - additional info from RNA level data ~60% of RefSeq genes could be extended at 5’ and 3’ ends (based on additional EST data = UTRs) Nusbaum Nature 437:551, 2005 (Fig S2) RefSeq: gene data agreed upon by everyone

RACE – rapid amplification of cDNA ends Fig To obtain sequence info corresponding to termini of mRNAs: where NNN… might be restriction site (eg. to aid in cloning RACE product) “specialized” RT-PCR strategy 5’ RACE - mapping 5’ end of mRNA useful in locating position of promoter - promoter immediately upstream of transcription start site How would you carry out 3’ RACE (to determine exact position of 3’end of mRNA)?