Python for Informatics: Exploring Information

Slides:

Advertisements

Similar presentations

Introduction to Computing Using Python CSC Winter 2013 Week 8: WWW and Search  World Wide Web  Python Modules for WWW  Web Crawling  Thursday:

Advertisements

Muhammad Taimoor Khan

© 2010, Robert K. Moniot Chapter 1 Introduction to Computers and the Internet 1.

IST 535 Week 1 Class Orientation / Review of Web Basics.

Layer 7- Application Layer

Introduction to Web Database Processing

The World Wide Web and the Internet Dr Jim Briggs 1WUCM1.

IST 221 Internet Concepts and Applications Internet, WWW and HTML 1.

© 2004, Robert K. Moniot Chapter 1 Introduction to Computers and the Internet.

The Internet 8th Edition Tutorial 1 Browser Basics.

HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Application Layer Functionality and Protocols Network Fundamentals – Chapter.

WEB DESIGN SOME FOUNDATIONS. SO WHAT IS THIS INTERNET.

INTRODUCTION TO WEB DATABASE PROGRAMMING

Human-Computer Interface Course 5. ISPs and Internet connection.

Web Architecture Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.

Copyright © cs-tutorial.com. Introduction to Web Development In 1990 and 1991,Tim Berners-Lee created the World Wide Web at the European Laboratory for.

Networking Part 2 Charles Severance Review... Layered Network Model A layered approach allows the problem of implementing a network to be broken into.

Chapter 1: Introduction to Web

 TCP/IP is the communication protocol for the Internet  TCP/IP defines how electronic devices should be connected to the Internet, and how data should.

FTP (File Transfer Protocol) & Telnet

2013Dr. Ali Rodan 1 Handout 1 Fundamentals of the Internet.

Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.

CP476 Internet Computing Lecture 5 : HTTP, WWW and URL 1 Lecture 5. WWW, HTTP and URL Objective: to review the concepts of WWW to understand how HTTP works.

TCP/IP Protocol Suite 1 Chapter 22 Upon completion you will be able to: World Wide Web: HTTP Understand the components of a browser and a server Understand.

Lecture#2 on Internet and World Wide Web. Internet Applications Electronic Mail ( ) Electronic Mail ( ) Domain mail server collects incoming mail.

© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 Network Services Networking for Home and Small Businesses – Chapter.

How Web Servers and the Internet Work by by: Marshall Brainby: Marshall Brain

XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.

Network Services Networking for Home & Small Business.

HOW WEB SERVER WORKS? By- PUSHPENDU MONDAL RAJAT CHAUHAN RAHUL YADAV RANJIT MEENA RAHUL TYAGI.

Chapter 1: The Internet and the WWW CIS 275—Web Application Development for Business I.

Okay, here’s a scenario… You’re sitting at a computer…. Type in www. yourcompany.com As soon as you click on search your browser will ask your Operation.

Web Engineering we define Web Engineering as follows: 1) Web Engineering is the application of systematic and proven approaches (concepts, methods, techniques,

MySQL and PHP Internet and WWW. Computer Basics A Single Computer.

Kingdom of Saudi Arabia Ministry of Higher Education Al-Imam Muhammad Ibn Saud Islamic University College of Computer and Information Sciences Chapter.

1 Welcome to CSC 301 Web Programming Charles Frank.

Application Layer Khondaker Abdullah-Al-Mamun Lecturer, CSE Instructor, CNAP AUST.

HTTP - Forms - AppEngine Jim Eng

ECEN “Internet Protocols and Modeling”, Spring 2012 Course Materials: Papers, Reference Texts: Bertsekas/Gallager, Stuber, Stallings, etc Class.

WWW: an Internet application Bill Chu. © Bei-Tseng Chu Aug 2000 WWW Web and HTTP WWW web is an interconnected information servers each server maintains.

Introduction to Dynamic Web Content Dr. Charles Severance

1 WWW. 2 World Wide Web Major application protocol used on the Internet Simple interface Two concepts –Point –Click.

Protocols COM211 Communications and Networks CDA College Olga Pelekanou

IS-907 Java EE World Wide Web - Overview. World Wide Web - History Tim Berners-Lee, CERN, 1990 Enable researchers to share information: Remote Access.

2: Application Layer 1 Chapter 2: Application layer r 2.1 Principles of network applications  app architectures  app requirements r 2.2 Web and HTTP.

Web Technologies Lecture 1 The Internet and HTTP.

Module: Software Engineering of Web Applications Chapter 2: Technologies 1.

Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.

Website Design, Development and Maintenance ONLY TAKE DOWN NOTES ON INDICATED SLIDES.

Protocols Monil Adhikari. Agenda Introduction Port Numbers Non Secure Protocols FTP HTTP Telnet POP3, SMTP Secure Protocols HTTPS.

JavaScript and Ajax (Internet Background) Week 1 Web site:

Data Communications and Computer Networks Chapter 2 CS 3830 Lecture 7 Omar Meqdadi Department of Computer Science and Software Engineering University of.

The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.

Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)

Simple Web Services. Internet Basics The Internet is based on a communication protocol named TCP (Transmission Control Protocol) TCP allows programs running.

Search Engine and Optimization 1. Introduction to Web Search Engines 2.

Python: Programming the Google Search (Crawling) Damian Gordon.

Introduction to Dynamic Web Content Dr. Charles Severance

IST 201 Chapter 11 Lecture 2. Ports Used by TCP & UDP Keep track of different types of transmissions crossing the network simultaneously. Combination.

1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.

Introduction to Dynamic Web Content

CISC103 Web Development Basics: Web site:

JavaScript and Ajax (Internet Background)

E-commerce | WWW World Wide Web - Concepts

E-commerce | WWW World Wide Web - Concepts

CISC103 Web Development Basics: Web site:

Introduction to Dynamic Web Content

Introduction to Dynamic Web Content

Presentation transcript:

Python for Informatics: Exploring Information Networked Programs Chapter 12 Python for Informatics: Exploring Information www.py4inf.com

Copyright 2009- Charles Severance, Jim Eng Phone call as protocol, not automobiles Unless otherwise noted, the content of this course material is licensed under a Creative Commons Attribution 3.0 License. http://creativecommons.org/licenses/by/3.0/. Copyright 2009- Charles Severance, Jim Eng

Client Server Internet Wikipedia

Internet Request HTTP Python JavaScript GET Data Store HTML Response memcache AJAX Templates CSS socket POST

Network Architecture....

Transport Control Protocol (TCP) Built on top of IP (Internet Protocol) Assumes IP might lose some data - stores and retransmits data if it seems to be lost Handles “flow control” using a transmit window Provides a nice reliable pipe Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite

http://en.wikipedia.org/wiki/Tin_can_telephone http://www.flickr.com/photos/kitcowan/2103850699/

TCP Connections / Sockets "In computer networking, an Internet socket or network socket is an endpoint of a bidirectional inter-process communication flow across an Internet Protocol-based computer network, such as the Internet." Process Process Internet Socket http://en.wikipedia.org/wiki/Internet_socket

TCP Port Numbers A port is an application-specific or process-specific software communications endpoint It allows multiple networked applications to coexist on the same server. There is a list of well-known TCP port numbers http://en.wikipedia.org/wiki/TCP_and_UDP_port

Clipart: http://www.clker.com/search/networksym/1 www.umich.edu Incoming E-Mail 25 blah blah blah blah Login 23 74.208.28.177 80 Web Server 443 109 Personal Mail Box Please connect me to the web server (port 80) on http://www.dr-chuck.com 110 Clipart: http://www.clker.com/search/networksym/1

Common TCP Ports Telnet (23) - Login IMAP (143/220/993) - Mail Retrieval SSH (22) - Secure Login POP (109/110) - Mail Retrieval HTTP (80) HTTPS (443) - Secure DNS (53) - Domain Name SMTP (25) (Mail) FTP (21) - File Transfer http://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers

Sometimes we see the port number in the URL if the web server is running on a "non-standard" port.

Sockets in Python Python has built-in support for TCP Sockets import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect( ('www.py4inf.com', 80) ) Host Port http://docs.python.org/library/socket.html

http://xkcd.com/353/

Application Protocol Since TCP (and Python) gives us a reliable socket, what to we want to do with the socket? What problem do we want to solve? Application Protocols Mail World Wide Web Source: http://en.wikipedia.org/wiki/Internet_Protocol_Suite

HTTP - Hypertext Transport Protocol The dominant Application Layer Protocol on the Internet Invented for the Web - to Retrieve HTML, Images, Documents etc Extended to be data in addition to documents - RSS, Web Services, etc.. Basic Concept - Make a Connection - Request a document - Retrieve the Document - Close the Connection http://en.wikipedia.org/wiki/Http

HTTP The HyperText Transport Protocol is the set of rules to allow browsers to retrieve web documents from servers over the Internet

What is a Protocol? A set of rules that all parties follow for so we can predict each other's behavior And not bump into each other On two-way roads in USA, drive on the right hand side of the road On two-way roads in the UK drive on the left hand side of the road

http://www.dr-chuck.com/page1.htm protocol host document http://www.youtube.com/watch?v=x2GylLq59rI Robert Cailliau CERN 1:17 - 2:19

Getting Data From The Server Each the user clicks on an anchor tag with an href= value to switch to a new page, the browser makes a connection to the web server and issues a “GET” request - to GET the content of the page at the specified URL The server returns the HTML document to the Browser which formats and displays the document to the user.

Making an HTTP request Connect to the server like www.dr-chuck.com a "hand shake" Request a document (or the default document) GET http://www.dr-chuck.com/page1.htm GET http://www.mlive.com/ann-arbor/ GET http://www.facebook.com

Browser

Web Server 80 Browser

Web Server 80 GET http://www.dr-chuck.com/page2.htm Browser

Web Server <h1>The Second Page</h1><p>If you like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 GET http://www.dr-chuck.com/page2.htm Browser

Web Server <h1>The Second Page</h1><p>If you like, you can switch back to the <a href="page1.htm">First Page</a>.</p> 80 GET http://www.dr-chuck.com/page2.htm Browser

Source: http://tools.ietf.org/html/rfc791 Internet Standards The standards for all of the Internet protocols (inner workings) are developed by an organization Internet Engineering Task Force (IETF) www.ietf.org Standards are called “RFCs” - “Request for Comments” Source: http://tools.ietf.org/html/rfc791

http://www.w3.org/Protocols/rfc2616/rfc2616.txt

Making an HTTP request Connect to the server like www.dr-chuck.com a "hand shake" Request a document (or the default document) GET http://www.dr-chuck.com/page1.htm GET http://www.mlive.com/ann-arbor/ GET http://www.facebook.com

Port 80 is the non-encrypted HTTP port “Hacking” HTTP Web Server HTTP Request HTTP Response $ telnet www.dr-chuck.com 80 Trying 74.208.28.177... Connected to www.dr-chuck.com. Escape character is '^]'. GET http://www.dr-chuck.com/page1.htm <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>. </p> Browser Port 80 is the non-encrypted HTTP port

Accurate Hacking in the Movies Matrix Reloaded Bourne Ultimatum Die Hard 4 ... http://nmap.org/movies.html http://www.youtube.com/watch?v=Zy5_gYu_isg

$ telnet www.dr-chuck.com 80 Trying 74.208.28.177... Connected to www.dr-chuck.com.Escape character is '^]'. GET http://www.dr-chuck.com/page1.htm <h1>The First Page</h1> <p>If you like, you can switch to the <a href="http://www.dr-chuck.com/page2.htm">Second Page</a>.</p> Connection closed by foreign host.

Hmmm - This looks kind of Complex.. Lots of GET commands

si-csev-mbp:tex csev$ telnet www.umich.edu 80 Trying 141.211.144.190... Connected to www.umich.edu.Escape character is '^]'. GET / <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><title>University of Michigan</title><meta name="description" content="University of Michigan is one of the top universities of the world, a diverse public institution of higher learning, fostering excellence in research. U-M provides outstanding undergraduate, graduate and professional education, serving the local, regional, national and international communities." />

... <link rel="alternate stylesheet" type="text/css" href="/CSS/accessible.css" media="screen" title="accessible" /><link rel="stylesheet" href="/CSS/print.css" media="print,projection" /><link rel="stylesheet" href="/CSS/other.css" media="handheld,tty,tv,braille,embossed,speech,aural" />... <dl><dt><a href="http://ns.umich.edu/htdocs/releases/story.php?id=8077"> <img src="/Images/electric-brain.jpg" width="114" height="77" alt="Top News Story" /></a><span class="verbose">:</span></dt><dd><a href="http://ns.umich.edu/htdocs/releases/story.php?id=8077">Scientists harness the power of electricity in the brain</a></dd></dl> As the browser reads the document, it finds other URLs that must be retreived to produce the document.

The big picture... <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"> <head> <title>University of Michigan</title> .... @import "/CSS/graphical.css"/**/; p.text strong, .verbose, .verbose p, .verbose h2{text-indent:-876em;position:absolute} p.text strong a{text-decoration:none} p.text em{font-weight:bold;font-style:normal} div.alert{background:#eee;border:1px solid red;padding:.5em;margin:0 25%} a img{border:none} .hot br, .quick br, dl.feature2 img{display:none} div#main label, legend{font-weight:bold} ...

Firebug reveals the detail... If you haven't already installed the Firebug FireFox extenstion you need it now It can help explore the HTTP request-response cycle Some simple-looking pages involve lots of requests: HTML page(s) Image files CSS Style Sheets Javascript files

Lets Write a Web Browser!

An HTTP Request in Python import socket mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) mysock.connect(('www.py4inf.com', 80)) mysock.send('GET http://www.py4inf.com/code/romeo.txt HTTP/1.0\n\n') while True: data = mysock.recv(512) if ( len(data) < 1 ) : break print data mysock.close()

HTTP/1.1 200 OK Date: Sun, 14 Mar 2010 23:52:41 GMT Server: Apache Last-Modified: Tue, 29 Dec 2009 01:31:22 GMT ETag: "143c1b33-a7-4b395bea" Accept-Ranges: bytes Content-Length: 167 Connection: close Content-Type: text/plain But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief HTTP Header while True: data = mysock.recv(512) if ( len(data) < 1 ) : break print data HTTP Body

Making HTTP Easier With urllib

Using urllib in Python Since HTTP is so common, we have a library that does all the socket work for us and makes web pages look like a file import urllib fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt') for line in fhand: print line.strip() http://docs.python.org/library/urllib.html urllib1.py

But soft what light through yonder window breaks import urllib fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt') for line in fhand: print line.strip() But soft what light through yonder window breaks It is the east and Juliet is the sun Arise fair sun and kill the envious moon Who is already sick and pale with grief http://docs.python.org/library/urllib.html urllib1.py

Like a file... urlwords.py import urllib fhand = urllib.urlopen('http://www.py4inf.com/code/romeo.txt') counts = dict() for line in fhand: words = line.split() for word in words: counts[word] = counts.get(word,0) + 1 print counts urlwords.py

Reading Web Pages urllib1.py import urllib fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print line.strip() <h1>The First Page</h1><p>If you like, you can switch to the<a href="http://www.dr-chuck.com/page2.htm">Second Page</a>.</p> urllib1.py

Going from one page to another... import urllib fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print line.strip() <h1>The First Page</h1><p>If you like, you can switch to the<a href="http://www.dr-chuck.com/page2.htm">Second Page</a>.</p>

Google import urllib fhand = urllib.urlopen('http://www.dr-chuck.com/page1.htm') for line in fhand: print line.strip()

Parsing HTML (a.k.a Web Scraping)

What is Web Scraping? When a program or script pretends to be a browser and retrieves web pages, looks at those web pages, extracts information and then looks at more web pages. Search engines scrape web pages - we call this “spidering the web” or “web crawling” http://en.wikipedia.org/wiki/Web_scraping http://en.wikipedia.org/wiki/Web_crawler

Server GET HTML GET HTML

Why Scrape? Pull data - particularly social data - who links to who? Get your own data back out of some system that has no “export capability” Monitor a site for new information Spider the web to make a database for a search engine

Scraping Web Pages There is some controversy about web page scraping and some sites are a bit snippy about it. Google: facebook scraping block Republishing copyrighted information is not allowed Violating terms of service is not allowed

http://www.facebook.com/terms.php

The Easy Way - Beautiful Soup You could do string searches the hard way Or use the free software called BeautifulSoup from www.crummy.com http://www.crummy.com/software/BeautifulSoup/ Place the BeautifulSoup.py file in the same folder as your Python code...

urllinks.py import urllib from BeautifulSoup import * url = raw_input('Enter - ') html = urllib.urlopen(url).read() soup = BeautifulSoup(html) # Retrieve a list of the anchor tags # Each tag is like a dictionary of HTML attributes tags = soup('a') for tag in tags: print tag.get('href', None) urllinks.py

<h1>The First Page</h1><p>If you like, you can switch to the<a href="http://www.dr-chuck.com/page2.htm">Second Page</a>.</p> html = urllib.urlopen(url).read() soup = BeautifulSoup(html) tags = soup('a') for tag in tags: print tag.get('href', None) python urllinks.py Enter - http://www.dr-chuck.com/page1.htm http://www.dr-chuck.com/page2.htm

Summary The TCP/IP gives us pipes / sockets between applications We designed application protocols to make use of these pipes HyperText Transport Protocol (HTTP) is a simple yet powerful protocol Python has good support for sockets, HTTP, and HTML parsing