Analysis of single sequences. Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools U. Mass. Med. School.Biotools.

Slides:



Advertisements
Similar presentations
In Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Module 12 Human DNA Fingerprinting and Population Genetics p 2 + 2pq + q 2 = 1.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Finding Eukaryotic Open reading frames.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
DNA Sequencing and Gene Analysis
3 September, 2004 Chapter 20 Methods: Nucleic Acids.
Single DNA Sequence Analysis Tools BME 110: CompBio Tools Todd Lowe May 6, 2008.
Genetic Technologies By: Brenda, Dale, John, and Brady.
© Wiley Publishing All Rights Reserved. Working with a Single DNA Sequence.
Interdisciplinary Center for Biotechnology Research
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Objective 2: TSWBAT describe the basic process of genetic engineering and the applications of it.
IN THE NAME OF GOD. PCR Primer Design Lecturer: Dr. Farkhondeh Poursina.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
AP Biology Ch. 20 Biotechnology.
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Bioinformatics 生物信息学理论和实践 唐继军 北京林业大学计算生物学中心
1 Genetics Faculty of Agriculture Instructor: Dr. Jihad Abdallah Topic 13:Recombinant DNA Technology.
Technological Solutions. In 1977 Sanger et al. were able to work out the complete nucleotide sequence in a virus – (Phage 0X174) This breakthrough allowed.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Module 1 Section 1.3 DNA Technology
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Remember the limitations? –You must know the sequence of the primer sites to use PCR –How do you go about sequencing regions of a genome about which you.
DNA Technology. Overview DNA technology makes it possible to clone genes for basic research and commercial applications DNA technology is a powerful set.
DNA TECHNOLOGY AND GENOMICS CHAPTER 20 P
Genome Annotation Rosana O. Babu.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
By Melissa Rivera.  GENE CLONING: production of multiple identical copies of DNA  It was developed so scientists could work directly with specific genes.
From Genomes to Genes Rui Alves.
GENETIC ENGINEERING CHAPTER 20
Human Genomics. Writing in RED indicates the SQA outcomes. Writing in BLACK explains these outcomes in depth.
Concept 20.1: DNA cloning yields multiple copies of a gene or other DNA segment To work directly with specific genes, scientists prepare well-defined segments.
Chapter 10: Genetic Engineering- A Revolution in Molecular Biology.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
SEQUENCING DNA Jos. J. Schall Biology Department University of Vermont.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Chapter 16 Microbial Genomics “If we should succeed in helping ourselves through applied genetics before vengefully or accidentally exterminating ourselves,
DNA Technology and Genomics
Chapter 20 DNA Technology and Genomics. Biotechnology is the manipulation of organisms or their components to make useful products. Recombinant DNA is.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Human Genomics Higher Human Biology. Learning Intentions Explain what is meant by human genomics State that bioinformatics can be used to identify DNA.
DNA Technology & Genomics CHAPTER 20. Restriction Enzymes enzymes that cut DNA at specific locations (restriction sites) yielding restriction fragments.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
Lecture 8 A toolbox for mechanistic biologists (continued)
Figure 20.0 DNA sequencers DNA Technology.
PCR Polymerase Chain Reaction
Primer design.
Section 3: Gene Technologies in Detail
Relationship between Genotype and Phenotype
Relationship between Genotype and Phenotype
Chapter 14 Bioinformatics—the study of a genome
Screening a Library for Clones Carrying a Gene of Interest
Genome Center of Wisconsin, UW-Madison
The student is expected to: (6H) describe how techniques such as DNA fingerprinting, genetic modifications, and chromosomal analysis are used to study.
Introduction to Bioinformatics II
GENE TECHNOLOGY Chapter 13.
Relationship between Genotype and Phenotype
Presentation transcript:

Analysis of single sequences

Toolboxes EMBOSS –Many portals. (E.g)E.g Biology Workbench ExPasy proteomics tools U. Mass. Med. U. Mass. Med. School Many, many more…

Before we start - VecScreen When you get a DNA sequence from the sequencer, make sure it is really the sequence you think it is. If you don’t you may spend a lot of time analysing the wrong sequence!!! Possible problems: contamination! Work clean. Always: Vector contamination.

Vector contamination Failure to recognize foreign segments in a sequence can: –Lead to erroneous conclusions about the biological significance of the sequence –Waste time and effort in analysis of contaminated sequence –Delay the release of the sequence in a public database –Pollute public databases with contaminated sequence

Reminder: Cloning procedure The DNA of interest is cloned into a vector. The resultant DNA may (probably does) contain sections from the vector.

VecScreen VecScreen is a system for quickly identifying segments of a nucleic acid sequence that may be of vector origin. NCBI developed VecScreen to combat the problem of vector contamination in public sequence databases.

VecScreen

EMBOSS European Molecular Biology Open Software Suite. Built for use by commandline. Many EMBOSS portals, servers and mirrors are available. Each program has its help file. One server: Examples of a few EMBOSS programs:

Briefly – What is PCR The polymerase chain reaction (PCR) is a technique to amplify a single copy of a piece of DNA.

Briefly – What is PCR The number of copies of the target DNA increases exponentially. After 35 cycles: 2 36 = 68 billion copies.

Primer design Primer Length: the optimal length is bp. Primer Melting Temperature: Temperature at which one half of the DNA duplex will dissociate. T m of o C produce best results. GC Content Primer Secondary Structures Repeats

Primer design Avoid Template secondary structure. Avoid Cross homology: –Commonly, primers are BLASTed to test the specificity.

primer3 Is a program from the Whitehead Institute, written by Steve Rozen and Helen J. Skaletsky, for finding primers and oligonucleotide probes. One interface to 'primer3' is eprimer3, an EMBOSS program. Primer3Plus is a nicer interface to primer3, from Biotools (U. Mass. Med. School.). We will use it.Primer3Plus

Gene prediction Identifying stretches of sequence, usually genomic DNA, that are biologically functional. This especially includes protein-coding genes Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced. Annotation

Gene prediction Prokaryotes: –The sequence coding for a protein occurs as one contiguous open reading frame (ORF), typically many hundreds or thousands of bp. Eukaryotes: –CpG islands and binding sites for a poly(A) tail. –Difficult to use ORF detection because of splicing.

Gene prediction Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. Genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. (Alexandre Lomsadze et. al., 2005)

SixPack “Display a DNA sequence with 6-frame translation and ORFs” Set “Minimum size of ORFs” to 300, to obtain only meaningful ORFs (Proteins are usually longer than 100 aa). Set “ORF start with an M?” to “Yes” to obtain only ORFs that begin with a Methionine.

SixPack The 1 st section in the results page lists all the ORFs discovered. >NM_ _1_ORF1 Translation of NM_ in frame 1, ORF 1, threshold 500, 42aa GIKRLLEGQFCYRAFTWPVEITSMQTTVRDFEEDSYLSLLVS >NM_ _1_ORF2 Translation of NM_ in frame 1, ORF 2, threshold 500, 909aa MDFISSLIVGCAQVLCESMNMAERRGHKTDLRQAITDLETAIGDLKAIRDDLTLRIQQDG LEGRSCSNRAREWLSAVQVTETKTALLLVRFRRREQRTRMRRRYLSCFGCADYKLCKKVS AILKSIGELRERSEAIKTDGGSIQVTCREIPIKSVVGNTTMMEQVLEFLSEEEERGIIGV YGPGGVGKTTLMQSINNELITKGHQYDVLIWVQMSREFGECTIQQAVGARLGLSWDEKET GENRALKIYRALRQKRFLLLLDDVWEEIDLEKTGVPRPDRENKCKVMFTTRSIALCNNMG

SixPack The 2 nd section shows a map of where the ORFs are in the actual sequence.

EMBOSS / plotorf Plots the ORFs found by sixpack:

ORF finder

Problems with ORF finding ORF finding can detect only 85% of genes. Short proteins More than 1 long ORF. Alternative start codon (not always the one furthest from the stop codon).

Possible solutions Searching the databases for similar proteins. Existence of such a protein will indicate this is a true gene. Gene prediction tools: –GeneMark: –Many more (e.g. see CBCB website)CBCB website

GeneMark The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods… Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification. (Alexandre Lomsadze et. al., 2005)

GeneMark Sample output: Exon prediction for PIP1B (remember the Gene entry?)Gene entry Gene Exon Strand Exon Exon Range Exon Start/End # # Type Length Frame Initial Internal Internal Terminal # protein sequence of predicted genes >gene_1|GeneMark.hmm|286_aa MEGKEEDVRVGANKFPERQPIGTSAQSDKDYKEPPPAPLFEPGELASWSFWRAGIAEFIA TFLFLYITVLTVMGVKRSPNMCASVGIQGIAWAFGGMIFALVYCTAGISGGHINPAVTFG LFLARKLSLTRAVYYIVMQCLGAICGAGVVKGFQPKQYQALGGGANTIAHGYTKGSGLGA EIIGTFVLVYTVFSATDAKRNARDSHVPILAPLPIGFAVFLVHLATIPITGTGINPARSL GAAIIFNKDNAWDDHWVFWVGPFIGAALAALYHVIVIRAIPFKSRS

extractseq Usually one would use a sequence editing software like BioEdit. Extractseq is one editing tool available from EMBOSS. Many more options in command line option (see manual)

BioEdit

Seqret generates a multiple sequence file emma aligns the files Prettyplot generates a graphical alignment Multiple sequence alignment using EMBOSS Usually, one uses better tools for this. We’ll see them later on in the course.

Restriction maps Represent the locations in a DNA sequence cut by restriction enzymes. Are used, for example, in identifying whether DNA in a test-tube is the same as its putative sequence. Can be used in cloning to design inserts for plasmids.

ReMap Display sequence with restriction sites, translation etc. Useful in identification of small nucleotide polymorphisms (SNPs). If a SNP changes a restriction site, it will cause that RE to cut the DNA differently compared with the wild type. Other RE programs: redata, restrict …

PepStats PEPSTATS of AAU from 1 to 1020 Molecular weight = Residues = 1020 Average Residue Weight = Charge = 7.5 Isoelectric Point = A280 Molar Extinction Coefficient = A280 Extinction Coefficient 1mg/ml = 0.88 Improbability of expression in inclusion bodies = ResidueNumberMole%DayhoffStat A = Ala B = Asx … Y = Tyr Z = Glx PropertyResiduesNumberMole% Tiny(A+C+G+S+T) Small(A+B+C+D+G+N+P+S+T+V) Aliphatic(A+I+L+V) Aromatic(F+H+W+Y) Non-polar(A+C+F+G+I+L+M+P+V+W+Y) Polar(D+E+H+K+N+Q+R+S+T+Z) Charged(B+D+E+H+K+R+Z) Basic(H+K+R) Acidic(B+D+E+Z)

PepInfo Tiny Small Aliphatic Aromatic Non-polar Polar Charged Basic Acidic

PepInfo Protein with transmembrane sections

PepInfo Protein without transmembrane sections

TMHMM Very good tool for identifying transmembrane segments.

Conclusion A tip of the iceberg of what can be done with a sequence. If you start working with sequences, you will have to decide which tools suit you best. It has a lot to do with personal preference and something to do with algorithm accuracy.