Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…

(apologies—missing the citation, now lost) Bottom line: Roughly half of your (and my) genome is the fossil wreckage of genomic parasites. We know this (in part) from sequence alignments. ~45% Wrapping up our last topic: You and your (DNA) parasites

So far, we’ve talked about DNA, RNA and protein sequences How to compare sequences to decide if they are related Having databases full of sequences and comparing them rapidly (BLAST) In fact, many such databases exist, so today we’ll start with a brief tour of some of the biological data on the web.

Just some of the resources available for bioinformatics Think of these as the raw data for new discoveries

Just some of the resources available for bioinformatics Think of these as the raw data for new discoveries >75K protein-protein interactions >1,300 biochemical processes and reactions, described in detail OMIM = the most important resource for human genetic disease Medline has >22 million research articles, many with complete text online GEO has ~900K experiments, each measuring 1000’s of mRNA or protein abundances

Live demo OMIM, Reactome, Human Protein Atlas

It’s nice to know that all of this exists, but ideally, you’d like to be able to so something constructive with the data. That means getting the data inside your own programs. All of these databases let you download data in big batches, but this isn’t always the case, so…. We saw one way to do this in AppSoma. Here’s another.

Let’s empower your Python scripts to grab data from the web. We’ll use Python library/module = an optional, specialized set of Python methods This particular Python module is called urllib2. urllib2 is: A collection of programs/tools to let you to surf the web from inside your programs. Much more powerful than the simple tasks we’ll do with it. More details: http://docs.python.org/2/library/urllib2.htmlhttp://docs.python.org/2/library/urllib2.html

The basic idea: We first set up a “request” by opening a connection to the URL. We then save the response in a variable and print it. If it can’t connect to the site, it’ll print out a helpful error message instead of the page. You can more or less use the commands in a cookbook fashion….

import urllib2# include the urllib2 module url = "http://www.utexas.edu/" try:# this 'try' statement tells Python that we might expect an error. request = urllib2.urlopen(url)# setup a request page = request.read()# save the response print page# show the result to the user except urllib2.HTTPError:# handle a page not found error print "Could not find page." For example:  Run this…

>>> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> Home | The University of Texas at Austin <!— …and so on…  We just captured the UT web page and printed it out (minus the images)…

That was a static web page. Let’s try one that requires some sort of action, for example by entering a document id or an id code for a sequence. Many web pages pass this information along in the web URL itself…

Here’s a complete Python program to retrieve a single entry from Medline: import urllib2 pmid = 11237011 # Insert the pmid where the {} are in the following URL: url = "http://www.ncbi.nlm.nih.gov/pubmed/{0}?report=medline&format=text".format(pmid) try:# there might be an error! request = urllib2.urlopen(url) page = request.read() print page except urllib2.HTTPError:# handle page not found error print "Could not connect to Medline!"

If you run that program, you should get back… the Medline entry for the human genome sequence paper >>> PMID- 11237011 OWN - NLM STAT- MEDLINE DA - 20010309 DCOM- 20010322 LR - 20061115 IS - 0028-0836 (Print) IS - 0028-0836 (Linking) VI - 409 IP - 6822 DP - 2001 Feb 15 TI - Initial sequencing and analysis of the human genome. PG - 860-921 AB - The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. FAU - Lander, E S AU - Lander ES AD - Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, Massachusetts 02142, USA. lander@genome.wi.mit.edu [and so on]

>>> PMID- 11237011 OWN - NLM STAT- MEDLINE DA - 20010309 DCOM- 20010322 LR - 20061115 IS - 0028-0836 (Print) IS - 0028-0836 (Linking) VI - 409 IP - 6822 DP - 2001 Feb 15 TI - Initial sequencing and analysis of the human genome. PG - 860-921 AB - The human genome holds an extraordinary trove of information about human development, physiology, medicine and evolution. Here we report the results of an international collaboration to produce and make freely available a draft sequence of the human genome. We also present an initial analysis of the data, describing some of the insights that can be gleaned from the sequence. FAU - Lander, E S AU - Lander ES AD - Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, Massachusetts 02142, USA. lander@genome.wi.mit.edu [and so on] If you run that program, you should get back… We just printed it. We could have saved it or extracted data from it. For example…

Here’s our Python program again to retrieve a single entry from Medline. How would we modify this to count the authors? import urllib2 pmid = 11237011 # Insert the pmid where the {} are in the following URL: url = "http://www.ncbi.nlm.nih.gov/pubmed/{0}?report=medline&format=text".format(pmid) try:# there might be an error! request = urllib2.urlopen(url) page = request.read() print page except urllib2.HTTPError:# handle page not found error print "Could not connect to Medline!"

Here’s our Python program again to retrieve a single entry from Medline. How would we modify this to count the authors? import urllib2 pmid = 11237011 # Insert the pmid where the {} are in the following URL: url = "http://www.ncbi.nlm.nih.gov/pubmed/{0}?report=medline&format=text".format(pmid) try:# there might be an error! request = urllib2.urlopen(url) page = request.read() print page.count("AU - ") except urllib2.HTTPError:# handle page not found error print "Could not connect to Medline!" Medline begins author lines with "AU - ", so…  Run this, & get … >>> 255 So, there were 255 authors on one (of the two) human genome papers

Queries to Medline or any other NCBI database, including GenBank, are described at: http://www.ncbi.nlm.nih.gov/books/NBK3862/ http://www.ncbi.nlm.nih.gov/books/NBK3862/ You can often figure out the form of the URL just by looking something up in a database, then noting the address of the web page with the data. This very simple approach could easily be the basis for: a home-made web browser a program to consult biological databases in real time a program to map the internet, etc. Of course, with this kind of power available to you, the imagination reels...

Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…

Similar presentations

Presentation on theme: "Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…

Similar presentations

Presentation on theme: "Wrapping up our last topic: You and your (DNA) parasites Events like these, happening over and over again, have led to…"— Presentation transcript:

Similar presentations

About project

Feedback