Computer Storage of Sequences

Slides:



Advertisements
Similar presentations
Genome Annotation: A Protein-centric Perspective.
Advertisements

Bioinformatics Ayesha M. Khan Spring 2013.
Databases (“knowledge bases”) used in genome analysis
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
BIOINFORMATICS Ency Lee.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Protein databases Henrik Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
Biological Databases and Tools Sandra Sinisi / Kathryn Steiger November 25, 2002.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
Protein Information Resource Protein Information Resource, 3300 Whitehaven St., Georgetown University, Washington, DC Contact
Function preserves sequences
NCBI Literature Databases: PubMed
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Copyright OpenHelix. No use or reproduction without express written consent1.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
What is BLAST? Basic BLAST search What is BLAST?
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
BIOINFORMATICS. Bioinformatics is the application of statistics and computer science to the field of molecular biology. The term bioinformatics was coined.
What is BLAST? Basic BLAST search What is BLAST?
Protein databases Henrik Nielsen
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
생물정보학 Bioinformatics.
Mangaldai College, Mangaldai
PIR: Protein Information Resource
Bioinformatics and BLAST
BLAST.
Introduction to Bioinformatics
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Computer Storage of Sequences (Chapter 2 of Bioinformatics: Sequence and Genome Analysis By David W. Mount) CSE730: Seminar on “Information Retrieval of Biomedical Text and Data”

Outline Storing DNA/Protein sequences into computer files or databases. Related information placed in the database along with the sequence in a number of sequence data formats. Online public access Databases for sequence retrieval.

Nucleotide Sequence Nomenclature Committee of the International Union of Biochemistry Code Nucleic Acid(s) A Adenine M A or C (amino) C Cytosine R A or G (purine) G Guanine W A or T (weak) T Thymine S C or G (strong) U Uracil Y C or T (pyrimidine) K G or T (keto) V A or C or G H A or C or T D A or G or T B C or G or T N A or G or C or T (any)

Protein Sequence Code Amino acid A Alanine N Asparagine B P Proline C Cysteine Q Glutamine D Aspartic acid R Arginine E Glutamic acid S Serine F Phenylalanine T Threonine G Glycine V Valine H Histidine W Tryptophan I Isoleucine X Unknown K Lysine Y Tyrosine L Leucine Z M Methionine Adapted from IUPAC-IUB (1969,1972, 1983)

Sequence Formats Sequence is stored as ASCII text (i.e. sequence of A,G,C,T…) along with annotations. Different sequence formats recognized by different sequence analyzer programs. Sequence Format includes accessory information, gene names, source organism, investigator name, references, and the actual sequence.

Sequence Formats (continued) FASTA GenBank Flat File format PIR/CODATA format EMBL sequence entry format Intelligenetics sequence entry format GCG (Genetics Computer Group) sequence entry format. ASN.1 XML

Databases NCBI GenBank at the National Center of Biotechnology Information (NCBI), National Library of Medicine, Washington, DC NBRF Protein Information Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC

Databases (continued) SwissProt The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research. EMBL European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England DDBJ DNA DataBank of Japan (DDBJ) at Mishima, Japan

Databases on Internet NCBI http://www.ncbi.nlm.nih.gov PIR http://www-nbrf.georgetown.edu/pirwww SwissProt http://www.expasy.ch/cgi-bin/sprot-search-de EMBL http://www.ebi.ac.uk/embl/index.html DDBJ http://www.ddbj.nig.ac.jp/

NCBI National resource for molecular biology information. Maintains comprehensive databases for variety of Biotech related information. Develops and manages access to a range of databases and softwares for scientific and medical communities.

NCBI : Integrated Databases Literature Databases Pubmed PubMed Central OMIM PROW BookShelf

NCBI : Integrated Databases (continued) Nucleotide Databases GenBank EST Database GSS Database SNPs Database RefSeq STS Database

NCBI : Integrated Databases (continued) Entrez Databases Pubmed Protein Sequence Database Nucleotide Sequence Database Taxonomy OMIM

GenBank GenBank is the NIH genetic sequence database. Annotated collection of all publicly available DNA sequences. GenBank is a part of an international collaboration of sequence databases along with EMBL and DDBJ.

GenBank DNA Sequence Format DNA sequence in GenBank is formatted into distinct attributes as following Locus: locus name, sequence length, division, date Definition: description of entry Accession: unique accession number Version: version of sequence Keywords: keywords for cross referencing

GenBank DNA Sequence Format (continued) Source: source organism of DNA Organism: description of organism References: authors, title, journal, Medline, etc Features: information about sequence Base count: number of bases in sequence Origin: sequence data begin following origin. Genebank sample

NCBI : Tools Tools for Data Retrieval and submission Text Term Searching Sequence Similarity Searching Taxonomy Searching Sequence Submission

NCBI : ENTREZ Entrez is a search and retrieval system that integrates information from databases at NCBI. These databases include nucleotide sequences, protein sequences, macromolecular structures, whole genomes, and MEDLINE, PubMed. Etc. Entrez

NCBI : BLAST BLAST: Basic Local Alignment Search Tool It is a set of similarity search programs designed to explore available sequence databases. It uses a heuristic algorithm which is able to detect relationships among sequences which share only isolated regions of similarity Q-BLAST: It is a queuing system to BLAST that allows users to retrieve results at their convenience and format their results.

NCBI : BLAST (continued) Access to BLAST service Web-BLAST Standalone BLAST Network BLAST BLAST URL API

NCBI : BLAST (continued) BLAST Programs Blastp : Compares amino acid sequence against protein sequence Database Blastn : Compares nucleotide sequence against nucleotide sequence Database Blastx :Compares nucleotide query sequence against protein sequence Database Tblastn : Compares protein query sequence against nucleotide sequence Database BLAST

NBRF :PIR Protein Information Resource 3 Major Databases: PSD (Protein Sequence Database) iProClass PIR-NREF (Nonredundant REFerence protein database)

PIR: PSD The PIR, in collaboration with MIPS and JIPID, produces and distributes the PIR-International Protein Sequence Database (PSD) . Comprehensive and expertly annotated protein sequence database. The primary sources of PSD data are sequences from GenBank/EMBL/DDBJ translations, published literature, and direct submission to PIR-International.

PIR: PSD (continued) The PIR-PSD data is available in XML format and NBRF, PIR/CODATA formats. The sequence file is available in FASTA format. Also available at PIR UNIX FTP server. Address: ftp://ftp.pir.georgetown.edu/pir_databases/psd/

CODATA format CODATA format has approximately the same information as a GenBank or EMBL sequence file, but is slightly differently formatted and has different field names. Also called PIR format, used by NBRF. CODATA Sample

PIR: iProClass The iProClass database provides comprehensive descriptions of all proteins and serves as a framework for data integration in a distributed networking environment. Very user-friendly description.

PIR: NREF (Non-redundant REFerence protein database) Comprehensive: Containing all sequences from PIR-PSD, Swiss-Prot, TrEMBL, RefSeq, GenPept, and updated bi-weekly. Non-Redundant: Clustered by sequence identity and taxonomy at the species level. Source Attribution: Containing protein IDs and names from associated databases (with hypertext links), in addition to protein sequence, taxonomy, and bibliography. The current version (July 2002) consists of more than 809,000 non-redundant PIR-PSD, SwissProt and TrEMBL proteins organized with more than 36,200 PIR superfamilies, 145,340 families, and links to over 50 molecular biology databases.

Swiss-Prot Swiss-Prot is a protein knowledgebase established in 1986. Maintained collaboratively, by the Department of Medical Biochemistry of the University of Geneva (now the Swiss Institute of Bioinformatics) and the EMBL Data Library. Swiss-Prot Sequence Entry Example

Sequence Format Conversion READSEQ: Sequence Format Conversion program. http://bimas.dcrt.nih.gov/molbio/readseq/ Can convert to/from: ASN.1 FASTA CODATA GCG EMBL format GenBank format and many other formats

References http://www.ncbi.nlm.nih.gov http://www-nbrf.georgetown.edu/pirwww http://www.expasy.ch/cgi-bin/sprot-search-de http://www.ebi.ac.uk/embl/index.html http://www.ddbj.nig.ac.jp/

Thank You  Presented by: Hemal Patel & Jeetal Shah