Bioinformatics Lecture 4 BCH 550 Arjumand Warsy. Retrieving DNA Sequences.

Slides:



Advertisements
Similar presentations
What is RefSeqGene?.
Advertisements

Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to… Edward Marcotte/Univ. of Texas/BCH391L/Spring.
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Click the Enter button to begin using the Compendium Click to continue.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
© Wiley Publishing All Rights Reserved. How Most People Use Bioinformatics.
On line (DNA and amino acid) Sequence Information Lecture 7.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
BIOINFORMATICS Ency Lee.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Bioinformatics Primer HC Lee 2000 July. What is Bioinformatics? Biomedical/biotechnical information Reproduction and annotation of biosequences – DNA.
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
A Study of Cystic Fibrosis Using Web-Based Tools Anuradha Datta Murphy Graduate Student, Dept. of Molecular and Integrative Physiology, University of Illinois.
PubMed/How to Search, Display, Download & (module 4.1)
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Gene Expression Omnibus (GEO)
Complying with the NIH Public Access Policy: From Soup to Nuts
In addition to Word, Excel, PowerPoint, and Access, Microsoft Office® 2013 includes additional applications, including Outlook, OneNote, and Office Web.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
PubMed Overview From the HINARI Content page, we can access PubMed by clicking on Search inside HINARI full-text using PubMed. Note: If you do not properly.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
SRS Introductory Course 5/12/ Temporary and permanent sessions - Simple querying - Browsing indices - Standard and extended query forms - User defined.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the combined BLAST and Genome Browser Tutorial.
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
PubMed Basics Barbara A. Wood, MLIS Calder Library University of Miami Miller School of Medicine.
Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…
Archives and Information Retrieval
What is Bioinformatics?
Mangaldai College, Mangaldai
Genomes and Their Evolution
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
How to search NCBI.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Bioinformatics Lecture 4 BCH 550 Arjumand Warsy

Retrieving DNA Sequences

Introduction Protein sequences are simple with a narrow range of sizes (300 a.a long, plus or minus 200, except for a few giant ones), clearly defined boundaries, and specific functional attributes. Furthermore, proteins of microbes or higher eukaryotes (animal and plants) have roughly the same properties. The corresponding gene (DNA) sequences get more varied and complex in higher animals. Gene sizes in humans may vary from a few thousand bp to several hundred thousand bp. Not all DNA is coding for protein. Various types of DNA sequences are involved in defining a gene: –Regulatory regions (usually preceding the coding region); –Untranslated regions that precede and follow the coding regions –The protein-coding region In eukaryotes (yeast, plants, animals), the protein-coding region is divided into a variable number of exons interspersed with introns. As a consequence, working with DNA sequences is always trickier than working with protein sequences.

Going from protein sequences to DNA sequences In databases, the correspondence between protein and DNA sequences is not one-to-one. Many different — even non-overlapping — DNA sequences can be linked to the same protein or gene name. The primary transcript — that is generated by copying the DNA sequence of a gene from beginning to end, (including exons + introns). The mature transcript — the mRNA, generated from the primary transcript by discarding the introns). The strict protein-coding region — the open reading frame or ORF. Numerous types of partial sequences.

Given a protein sequence, how can the DNA sequence encompassing its coding region be retrieved? Retrieving the DNA sequence relevant to protein.

What is GenBank? GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25- 30). There are approximately 106,533,156,756 bases in 108,431,692 sequence records in the traditional GenBank divisions and 148,165,117,763 bases in 48,443,067 sequence records in the WGS division as of August 2009.Nucleic Acids Research, 2008 Jan;36(Database issue):D The complete release notes for the current version of GenBank are available on the NCBI ftp site. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which comprises the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis.

(Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30).Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30 Nucleic Acids Res Jan;36(Database issue):D Epub 2007 Dec 11. GenBank. Benson DABenson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.Karsch-Mizrachi ILipman DJOstell JWheeler DL National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA. GenBank (R) is a comprehensive database that contains publicly available nucleotide sequences for more than named organisms, obtained primarily through submissions from individual laboratories and batch submissions from large-scale sequencing projects. Most submissions are made using the web-based BankIt or standalone Sequin programs and accession numbers are assigned by GenBank staff upon receipt. Daily data exchange with the European Molecular Biology Laboratory Nucleotide Sequence Database in Europe and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through NCBI's retrieval system, Entrez, which integrates data from the major DNA and protein sequence databases along with taxonomy, genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the GenBank database are available by FTP. To access GenBank and its related retrieval and analysis services, begin at the NCBI Homepage: PMID: [PubMed - indexed for MEDLINE] PMCID: PMC

How to retrieve the nucleotide sequence 1. Point the browser to 2. To access the E. coli dUTPase entry quickly, simply enter the accession number (P06968) in the Search window at the top of the page and then click the Search button. 3. Stroll down to GeneID and click at the number of the gene.

P06968

Using GenBank for retrieving nucleotide sequence Search Nucleotide for XO1714 a GenBank entry consists of four parts. –The locus name (ECDUT): an arbitrary identifier, is followed by a short definition line and a unique accession number (X01714). –The Reference section lists article(s) relevant to the sequence determination. –The Features section lists the definitions and exact ranges of multiple Types of elements that have been recognized in the sequence. –The Sequence section rounds out the GenBank entry, where the nucleotides are listed between the Origin keyword and the final // that signals the very end of the entry. Numbering is provided to help relate the location of the dUTPase ORF ( ) to the actual nucleotide sequence.

Search for nucleotide sequence of human (homo sapiens) G-6-PD gene

How to save? 1. Scroll back to the top of the page for the ECDUT/X01714 entry. Refer to Figure 2-20 for what your screen should look like. 2. Choose FASTA from the Display drop-down menu, as shown in Figure Transform the content of this window into plain text by choosing Text from the drop-down menu located on the far right of the menu bar. 4. Save the FASTA sequence by using the following protocol: a. In the Edit menu of your Web browser, click Select All and then click Copy. b. Open a default Word document and, in the Edit menu of Word, click Paste. Then select a Courier font (8 or 10). c. Finally, save your document as dUTPaseDNA.txt by choosing the Save as type option text only (*.txt).

Using BLAST to Compare My Protein Sequence to Other Protein Sequences

BLAST BLAST (short for Basic Local Alignment Search Tool) is a great sequence- comparison tool that tells which of the other known proteins has a sequence similar to our sequence. This information can be used for a variety of purposes: –including the prediction of protein function, 3-D structure and –domain organization, –the identification of homologues (similar proteins)in other organisms.

How to use BLAST 1. Point your favorite Internet browser to The BLAST home page — probably the most frequented bioinformatic Web page in the world — appears. 2. Click the Protein-Protein BLAST (blastp) link in the top right. A Query screen appears. At this point, you need a FASTA- formatted protein sequence. 3. Open the file that contains your dUTPase FASTA- formatted protein sequence. This is the file that you created on your PC by using the steps shown earlier in the “Retrieving a list of related protein sequences” section of this chapter or get the sequence again. 4. Using your browser’s Edit menu, copy and paste ONE of the protein sequences (with its definition line) into the BLAST Search window.

Old appearance

Getting the sequence again

100% analogy

62% Analogy