Using Web-Services: NCBI E-Utilities, online BLAST BCHB524 2015 Lecture 19 By Edwards & Li Slides: https://goo.gl/OWjUMl.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to… Edward Marcotte/Univ. of Texas/BCH391L/Spring.
NGS Bioinformatics Workshop 1.3 Tutorial - Sequence Alignment and Searching March 22 nd, 2012 IRMACS Facilitator: Richard Bruskiewich Adjunct Professor,
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12 th Nov.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Bioinformatics and Phylogenetic Analysis
Lecture 2.21 Retrieving Information: Using Entrez.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction to Bioinformatics BLAST. Introduction –What is BLAST? –Query Sequence Formats –What does BLAST tell you? Choices –Variety of BLAST –BLAST.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
A data retrieval workflow using NCBI E-Utils + Python Part II: Jinja2 / Flask John Pinney Tech talk Tue 19 th Nov.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
BioPython Workshop Gershon Celniker Tel Aviv University.
Introduction to Python for Biologists Lecture 3: Biopython This Lecture Stuart Brown Associate Professor NYU School of Medicine.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
BLAST : Basic local alignment search tool B L A S T !
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Part 1 – PubMed Interface, Display options, Saving, Printing, and ing results. Instructions This part of the course is a PowerPoint demonstration.
Copyright OpenHelix. No use or reproduction without express written consent1.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
11/6/2013BCHB Edwards Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Computer Storage of Sequences
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Using Local Tools: BLAST
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 3 High-level Programming with Python Part III: Files and Directories Reference:
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Essential BioPython Manipulating Sequences with Seq 1.
E-utilities: Short course. The Entrez Query System at NCBI.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
What is BLAST? Basic BLAST search What is BLAST?
Using Local Tools: BLAST
A Practical Guide to NCBI BLAST
BioPython Download & Installation Documentation
Lecture 3.1 BLAST.
Basics of BLAST Basic BLAST Search - What is BLAST?
Using Web-Services: NCBI E-Utilities, online BLAST
Essential BioPython Retrieving Sequences from the Web
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Using Web-Services: NCBI E-Utilities, online BLAST
BioPython Download & Installation Documentation
Identifying templates for protein modeling:
Mangaldai College, Mangaldai
Bioinformatics and BLAST
BLAST.
BLAST.
Comparative Genomics.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Using Local Tools: BLAST
Using Local Tools: BLAST
Using Web-Services: NCBI E-Utilities, online BLAST
Basic Local Alignment Search Tool
Presentation transcript:

Using Web-Services: NCBI E-Utilities, online BLAST BCHB Lecture 19 By Edwards & Li Slides:

Outline NCBI E-Utilities …from a script, via the internet NCBI Blast …from a script, via the internet Exercises

NCBI Entrez Powerful web- portal for NCBI's online databases (38 currently) Nucleotide Protein PubMed Gene Structure Taxonomy OMIM etc…

NCBI Entrez We can do a lot using a web-browser Look up a specific record nucleotide, protein, mRNA, EST, PubMed, structure,… Search for matches to a gene or disease name Download sequence and other data associated with a nucleotide or protein Sometimes we need to automate the process Use Entrez to select and return the items of interest, rather than download, parse, and select.

NCBI E-Utilities Used to automate the use of Entrez capabilities. Google: Entrez Programming Utilities See also, Chapter 9 of the BioPython tutorialChapter 9

Play nice with the Entrez resources! No more than 3 URL requests per second. At most 100 requests during the day (biopython) Limit large jobs to either weekends or between 9:00PM - 5:00 AM. Supply your address and your tool name. Use Entrez history for large requests. …otherwise you or your computer could be banned! BioPython automates many of the requirements...

E-utilities contains 9 tools. EInfo (database statistics) ESearch (text searches) EPost (UID uploads) ESummary (document summary downloads) EFetch (data record downloads) ELink (Entrez links) EGQuery (global query) ESpell (spelling suggestions) ECitMatch (batch citation searching in PubMed)

Entrez Core Engine: EGQuery, ESearch, and ESummary EGQuery: egquery.fcgi?term=query ESearch: esearch.fcgi?db=database&term=query ESummary: esummary.fcgi?db=database&id=uid1,uid2,uid3,... Root URL:

Entrez Databases: EInfo, EFetch, and ELink EInfo: einfo.fcgi?db=database Efetch: efetch.fcgi?db=database&id=uid1,uid2,uid3 &rettype=report_type&retmode=data_mode Elink: elink.fcgi?dbfrom=initial_database&db=target_database &id=uid1,uid2,uid3 Root URL:

Entrez History Server: EPost EPost: epost.fcgi?db=database&id=uid1,uid2,uid3,... Use history example: esummary.fcgi?db=database&WebEnv=webenv&query_key=key Root URL: 1. &db = database; 2. &query_key = query key; 3. &WebEnv = web environment

Entrez DatabaseUID nameE-utility DB Name PubMedPMIDpubmed PubMed CentralPMCIDpmc ProteinGI numberprotein Entrez system identifiers

NCBI E-Utilities No need to use Python, BioPython Can form urls and parse XML directly. E-Info PubMed Info More

BioPython and Entrez E-Utilities As you might expect BioPython provides some nice tools to simplify this process from Bio import Entrez Entrez. = handle = Entrez.einfo() result = Entrez.read(handle) print result["DbList"] handle = Entrez.einfo(db='pubmed') result = Entrez.read(handle,validate=False) print result["DbInfo"]["Description"] print result["DbInfo"]["Count"] print result["DbInfo"].keys()

BioPython and Entrez E- Utililities "Thin" wrapper around E-Utilities web- services Use E-Utilities argument names db for database name, for example Use Entrez.read to make a simple dictionary from the XML results. Could also parse XML directly (ElementTree), or get results in genbank format (for sequence) Use result.keys() to "discover" structure of returned results.

E-Utilities Web-Services E-Info Discover database names and fields E-Search Search within a particular database Returns "primary ids" E-Fetch Download database entries by primary ids Others: E-Link, E-Post, E-Summary, E-GQuery

Using ESearch By default only get back some of the ids: Use retmax to get back more… Meaning of returned id is database specific… from Bio import Entrez Entrez. = handle = Entrez.esearch(db="pubmed", term="BRCA1") result = Entrez.read(handle) print result["Count"] print result["IdList"] handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]") result = Entrez.read(handle) print result["Count"] print result["IdList"]

Using EFetch from Bio import Entrez, SeqIO Entrez. = handle = Entrez.efetch(db="nucleotide", id=" ", rettype="gb") print handle.read() handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn] AND matK[Gene]") result = Entrez.read(handle) idlist = ','.join(result["IdList"]) handle = Entrez.efetch(db="nucleotide", id=idlist, rettype="gb") for r in SeqIO.parse(handle, "genbank"): print r.id, r.description

ESearch and EFetch together Entrez provides a more efficient way to combine ESearch and EFetch After esearch, Entrez already knows the ids you want! Sending the ids back with efetch makes Entrez work much harder Use the history mechanism to "remind" Entrez that it already knows the ids Access large result sets in "chunks".

ESearch and EFetch using esearch history from Bio import Entrez, SeqIO Entrez. = handle = Entrez.esearch(db="nucleotide", term="Cypripedioideae[Orgn]", usehistory="y") result = Entrez.read(handle) handle.close() count = int(result["Count"]) session_cookie = result["WebEnv"] query_key = result["QueryKey"] print count, session_cookie, query_key # Get the results in chunks of 100 chunk_size = 100 for chunk_start in range(0,count,chunk_size) : handle = Entrez.efetch(db="nucleotide", rettype="gb", retstart=chunk_start, retmax=chunk_size, webenv=session_cookie, query_key=query_key) for r in SeqIO.parse(handle,"genbank"): print r.id, r.description handle.close()

NCBI Blast NCBI provides a very powerful blast search service on the web We can access this infrastructure as a web-service BioPython makes this easy! Ch. 7.1 in Tutorial

NCBI Blast Lots of parameters… Essentially mirrors blast options You need to know how to use blast first! Help on function qblast in module Bio.Blast.NCBIWWW: qblast(program, database, sequence,...) Do a BLAST search using the QBLAST server at NCBI. Supports all parameters of the qblast API for Put and Get. Some useful parameters: program blastn, blastp, blastx, tblastn, or tblastx (lower case) database Which database to search against (e.g. "nr"). sequence The sequence to search. ncbi_gi TRUE/FALSE whether to give 'gi' identifier. descriptions Number of descriptions to show. Def 500. alignments Number of alignments to show. Def 500. expect An expect value cutoff. Def matrix_name Specify an alt. matrix (PAM30, PAM70, BLOSUM80, BLOSUM45). filter "none" turns off filtering. Default no filtering format_type "HTML", "Text", "ASN.1", or "XML". Def. "XML". entrez_query Entrez query to limit Blast search hitlist_size Number of hits to return. Default 50 megablast TRUE/FALSE whether to use MEga BLAST algorithm (blastn only) service plain, psi, phi, rpsblast, megablast (lower case) This function does no checking of the validity of the parameters and passes the values to the server as is. More help is available at:

Required parameters: Blast program, Blast database, Sequence Returns XML format results, by default. Save results to a file, for parsing… NCBI Blast import os.path from Bio.Blast import NCBIWWW if not os.path.exists("blastn-nr xml"): result_handle = NCBIWWW.qblast("blastn", "nr", " ") blast_results = result_handle.read() result_handle.close() save_file = open("blastn-nr xml", "w") save_file.write(blast_results) save_file.close() # Do something with the blast results in blastn-nr xml

Results need to be parsed in order to be useful… NCBI Blast Parsing from Bio.Blast import NCBIXML result_handle = open("blastn-nr xml") for blast_result in NCBIXML.parse(result_handle): for desc in blast_result.descriptions: if desc.e < 1e-5: print '****Alignment****' print 'sequence:', desc.title print 'e value:', desc.e

Exercises Putative Human – Mouse BRCA1 Orthologs Write a program using NCBI's E-Utilities to retrieve the ids of RefSeq human BRCA1 proteins from NCBI. Use the query: "Homo sapiens"[Organism] AND BRCA1[Gene Name] AND REFSEQ Extend your program to search these protein ids (one at a time) vs RefSeq proteins (refseq_protein) using the NCBI blast web-service. Further extend your program to filter the results for significance (E-value < 1.0e-5) and to extract mouse sequences (match "Mus musculus" in the description).