Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce.

Similar presentations


Presentation on theme: "Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce."— Presentation transcript:

1 Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce  Parallel Computing

2 Introduction to Computing Using Python Data storage Beijing × 3 Paris × 5 Chicago × 5 Chicago × 3 Beijing × 6 Bogota × 3 Beijing × 2 Paris × 1 Chicago × 3 Paris × 2 Nairobi × 1 Nairobi × 7 Bogota × 2 one.html four.html two.html three.htmlfive.html The data collected by a web crawler can be stored in a text file

3 Introduction to Computing Using Python Data storage URL word count http://reed.cs.depaul.edu/lperkovic/one.html Paris 5 http://reed.cs.depaul.edu/lperkovic/one.html Beijing 3 http://reed.cs.depaul.edu/lperkovic/one.html Chicago 5 URL link http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/three.html URL word count http://reed.cs.depaul.edu/lperkovic/two.html Bogota 3 http://reed.cs.depaul.edu/lperkovic/two.html Paris 1 http://reed.cs.depaul.edu/lperkovic/two.html Beijing 2 URL link http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/four.html URL word count http://reed.cs.depaul.edu/lperkovic/four.html Paris 2... URL word count http://reed.cs.depaul.edu/lperkovic/one.html Paris 5 http://reed.cs.depaul.edu/lperkovic/one.html Beijing 3 http://reed.cs.depaul.edu/lperkovic/one.html Chicago 5 URL link http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/one.html http://reed.cs.depaul.edu/lperkovic/three.html URL word count http://reed.cs.depaul.edu/lperkovic/two.html Bogota 3 http://reed.cs.depaul.edu/lperkovic/two.html Paris 1 http://reed.cs.depaul.edu/lperkovic/two.html Beijing 2 URL link http://reed.cs.depaul.edu/lperkovic/two.html http://reed.cs.depaul.edu/lperkovic/four.html URL word count http://reed.cs.depaul.edu/lperkovic/four.html Paris 2...

4 Introduction to Computing Using Python Data storage A search engine app may then need to access this file to make queries such as 1.In which web pages does word X appear in? 2.What is the ranking of web pages containing word X, based on the number of occurrences of word X in the page? 3.How many pages contain word X? 4.What pages have a hyperlink to page Y? 5.What is the total number of occurrences of word ‘Paris’ across all web pages? 6.How many outgoing links does each visited page have? 7.How many incoming links does each visited page have? 8.What pages have a link to a page containing word X? 9.What page containing word X has the most incoming links? A text file is not ideal for this...

5 Introduction to Computing Using Python Data storage Beijing × 3 Paris × 5 Chicago × 5 Chicago × 3 Beijing × 6 Bogota × 3 Beijing × 2 Paris × 1 Chicago × 3 Paris × 2 Nairobi × 1 Nairobi × 7 Bogota × 2 one.html four.html two.html three.htmlfive.html The data collected by a web crawler can be stored in a text file...

6 Introduction to Computing Using Python Database files The data collected by a web crawler can be stored in a text file...... or in a database file Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2

7 Introduction to Computing Using Python Database files A database file consists of one or more tables Each table has a name and consists of rows and columns Each column has a name and contains data of a specific type Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Each row is a database record

8 Introduction to Computing Using Python Database files Database files are not read from or written to directly Instead, “read/write” commands are sent to a special type of server program called a database engine that manages the database The database engine accesses the database file on the user’s behalf The commands accepted by database engines are statements written in the Structured Query Language (SQL)

9 Introduction to Computing Using Python SQL SELECT FROM statement Link two.html three.html four.html five.html one.html two.html four.html UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html SELECT Link FROM Hyperlinks Hyperlinks SQL statement SELECT is used make queries into a database result table

10 Introduction to Computing Using Python SQL SELECT FROM statement SQL statement SELECT is used make queries into a database. SELECT Url, Word FROM Keywords Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 UrlWord one.htmlBeijing one.htmlParis one.htmlChicago two.htmlBogota two.htmlBeijing two.htmlParis three.htmlChicago three.htmlBeijing four.htmlChicago four.htmlParis four.htmlNairobi five.htmlNairobi five.htmlBogota

11 Introduction to Computing Using Python SQL SELECT FROM statement UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html SELECT * FROM Hyperlinks Hyperlinks SQL statement SELECT is used make queries into a database UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html

12 Introduction to Computing Using Python SQL DISTINCT keyword Link two.html three.html four.html five.html one.html UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html SELECT DISTINCT Link FROM Hyperlinks Hyperlinks SQL keyword DISTINCT removes duplicate records in the result table

13 Introduction to Computing Using Python SQL WHERE clause SQL clause WHERE is used to select only those records that satisfy a condition SELECT Url FROM Keywords WHERE Word = 'Paris' SELECT Url FROM Keywords WHERE Word = 'Paris' Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Url one.html two.html four.html “In which pages does word X appear in?”

14 OperatorExplanation = Equal <> Not equal > Greater than < Less than >= Greater than or equal <= Less than or equal BETWEEN Within an inclusive range Introduction to Computing Using Python SQL WHERE clause SQL clause WHERE is used to select only those records that satisfy a condition SELECT Column(s) FROM Table WHERE Column operator value SELECT Column(s) FROM Table WHERE Column operator value SELECT Column(s) FROM Table WHERE Column BETWEEN value1 AND value2 SELECT Column(s) FROM Table WHERE Column BETWEEN value1 AND value2

15 UrlFreq one.html5 two.html2 four.html1 Introduction to Computing Using Python SQL keyword DESC SQL keyword DESC is used to order the records in the result table in descending order SELECT Url, Freq FROM Keywords WHERE Word = 'Paris' ORDER by Freq DESC SELECT Url, Freq FROM Keywords WHERE Word = 'Paris' ORDER by Freq DESC Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 “What is the ranking of web pages containing word X, based on the number of occurrences of string X in the page?”

16 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 1.The URL of every page that has a link to web page four.html SELECT DISTINCT Url FROM Hyperlinks WHERE Link = 'four.html' SELECT DISTINCT Url FROM Hyperlinks WHERE Link = 'four.html'

17 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 2.The URL of every page that has an incoming link from page four.html SELECT DISTINCT Link FROM Hyperlinks WHERE Url = 'four.html' SELECT DISTINCT Link FROM Hyperlinks WHERE Url = 'four.html'

18 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 3.The URL and word for every word that appears exactly three times in the web page associated with the URL SELECT Url, Word from Keywords WHERE Freq = 3 SELECT Url, Word from Keywords WHERE Freq = 3

19 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 4.The URL, word, and frequency for every word that appears between 3 and 5 times, inclusive, in the web page associated with the URL SELECT * from Keywords WHERE Freq BETWEEN 3 AND 5 SELECT * from Keywords WHERE Freq BETWEEN 3 AND 5

20 Introduction to Computing Using Python SQL built-in functions SQL includes built-in math functions such as COUNT() and SUM() SELECT COUNT(*) FROM Keywords WHERE Word = 'Paris' SELECT COUNT(*) FROM Keywords WHERE Word = 'Paris' Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 3 “How many pages contain the word Paris?”

21 Introduction to Computing Using Python SQL built-in functions SQL includes built-in math functions such as COUNT() and SUM() SELECT SUM(Freq) FROM Keywords WHERE Word = 'Paris' SELECT SUM(Freq) FROM Keywords WHERE Word = 'Paris' Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 8

22 Url one.html2 two.html1 three.html1 four.html1 five.html3 Introduction to Computing Using Python SQL GROUP BY clause SQL clause GROUP BY groups the records of a table that have the same value in a column SELECT Url, COUNT(*) FROM Hyperlinks GROUP BY Url SELECT Url, COUNT(*) FROM Hyperlinks GROUP BY Url Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 “How many outgoing links does each web page have?”

23 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 1.The number of words, including duplicates, that page two.html contains SELECT SUM(Freq) From Keywords WHERE Url = 'two.html' SELECT SUM(Freq) From Keywords WHERE Url = 'two.html'

24 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 2.The number of distinct words page two.html contains SELECT Count(*) From Keywords WHERE Url = 'two.html' SELECT Count(*) From Keywords WHERE Url = 'two.html'

25 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 3.The number of words, including duplicates, that each web page has SELECT Url, SUM(Freq) FROM Keywords GROUP BY Url SELECT Url, SUM(Freq) FROM Keywords GROUP BY Url

26 Introduction to Computing Using Python Exercise Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 Write an SQL query that returns: 4.The number of incoming links each web page has SELECT Link, COUNT(*) FROM Hyperlinks GROUP BY Link SELECT Link, COUNT(*) FROM Hyperlinks GROUP BY Link

27 “What web pages have a link to a page containing word ‘Bogota’?” Introduction to Computing Using Python SQL queries involving multiple tables Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 This question requires a lookup of both tables: Look up Keywords to find the set S of URLs of pages containing word ‘Bogota’ Then look up Keywords to find the URLs of pages with links to pages in S

28 Introduction to Computing Using Python SQL queries involving multiple tables Hyperlinks Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 The SELECT statement can be used on multiple tables. SELECT * FROM Hyperlinks, Keywords

29 Introduction to Computing Using Python SQL queries involving multiple tables The SELECT statement can be used on multiple tables. UrlLinkUrlWordFre q one.htmltwo.htmlone.htmlBeijing3 one.htmltwo.htmlone.htmlParis5 one.htmltwo.htmlone.htmlChicago5 one.htmltwo.html Bogota3... five.htm l four.html Nairobi5 five.htm l four.htmlfive.htmlNairobi7 five.htm l four.htmlfive.htmlBogota2 SELECT * FROM Hyperlinks, Keywords 104 records, each a combination of a record in Hyperlinks and a record in Keywords The result table is the cross join of tables Hyperlink and Keywords It has five named columns corresponding to the two columns of table Hyperlinks and three columns of table Keywords. (Hyperlink)(Keywords) result table

30 Introduction to Computing Using Python SQL queries involving multiple tables Hyperlink Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 The SELECT statement can be used on multiple tables. SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url

31 Introduction to Computing Using Python SQL queries involving multiple tables The SELECT statement can be used on multiple tables. UrlLinkUrlWordFre q one.htmltwo.html Bogota3 one.htmltwo.html Beijing2 one.htmltwo.html Paris1 one.htmlthree.html Chicago3... five.htmlfour.html Paris2 five.htmlfour.html Nairobi5 SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url SELECT * FROM Hyperlinks, Keywords WHERE Hyperlinks.Url = Keywords.Url (Hyperlink)(Keywords)

32 Introduction to Computing Using Python SQL queries involving multiple tables Hyperlink Keywords UrlLink one.htmltwo.html one.htmlthree.html two.htmlfour.html three.htmlfour.html five.html one.html five.htmltwo.html five.htmlfour.html UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url “What web pages have a link to a page containing word ‘Bogota’?”

33 Introduction to Computing Using Python SQL queries involving multiple tables UrlLinkUrlWordFre q one.htmltwo.html Bogota3 four.htmlfive.html Bogota2 five.htmltwo.html Bogota3 (Hyperlink)(Keywords) SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url SELECT * FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url “What web pages have a link to a page containing word ‘Bogota’?”

34 Introduction to Computing Using Python SQL queries involving multiple tables Url one.html four.html five.html SELECT Hyperlinks.Url FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url SELECT Hyperlinks.Url FROM Hyperlinks, Keywords WHERE Keywords.Word = 'Bogota' AND Hyperlinks.Link = Keywords.Url “What web pages have a link to a page containing word ‘Bogota’?”

35 Introduction to Computing Using Python SQL CREATE TABLE statement SQL statement CREATE TABLE is used to create a table in a database file CREATE TABLE Keywords ( Url text, Word text, Freq int ) CREATE TABLE Keywords ( Url text, Word text, Freq int ) Keywords UrlWordFreq

36 Introduction to Computing Using Python SQL CREATE TABLE statement SQL statement CREATE TABLE is used to create a table in a database file CREATE TABLE TableName ( Column1 dataType1, Column2 dataType2,... ) CREATE TABLE TableName ( Column1 dataType1, Column2 dataType2,... ) TableName Column1Column2... SQL TypePython TypeExplanation INTEGERint Holds integer values REALfloat Holds floating-point values TEXTstr Holds string values, delimited with quotes BLOBbytes Holds sequence of bytes

37 Introduction to Computing Using Python SQL INSERT statement SQL statement INSERT is used to add a record to a table INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3) Keywords UrlWordFreq UrlWordFreq one.htmlBeijing3

38 Introduction to Computing Using Python SQL UPDATE statement SQL statement UPDATE is used to modify a record in a table UPDATE Keywords SET Freq = 4 WHERE Url = 'two.html' AND Word = 'Bogota' UPDATE Keywords SET Freq = 4 WHERE Url = 'two.html' AND Word = 'Bogota' Keywords UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota3 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2 UrlWordFreq one.htmlBeijing3 one.htmlParis5 one.htmlChicago5 two.htmlBogota4 two.htmlBeijing2 two.htmlParis1 three.htmlChicago3 three.htmlBeijing6 four.htmlChicago3 four.htmlParis2 four.htmlNairobi5 five.htmlNairobi7 five.htmlBogota2

39 Introduction to Computing Using Python Standard Library module sqlite3 The Python Standard Library includes module sqlite3 that provides an API for accessing database files It is an interface to a library of functions that accesses the database files directly >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> import sqlite3 >>> con = sqlite3.connect('web.db') sqlite3 function connect() takes as input the name of a database and returns an object of type Connection, a type defined in module sqlite3 The Connection object con is associated with database file web.db If database file web.db does not exists in the current working directory, a new database file web.db is created

40 Introduction to Computing Using Python Standard Library module sqlite3 The Python Standard Library includes module sqlite3 that provides an API for accessing database files It is an interface to a library of functions that accesses the database files directly >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() Connection method cursor() returns an object of type Cursor, another type defined in the module sqlite3 Cursor objects are responsible for executing SQL statements

41 Introduction to Computing Using Python Standard Library module sqlite3 The Python Standard Library includes module sqlite3 provides an API for accessing database files It is an interface to a library of functions that accesses the database files directly >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() >>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)") >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() >>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)") The Cursor class supports method execute() which takes an SQL statement as a string, and executes it >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() >>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)") >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> import sqlite3 >>> con = sqlite3.connect('web.db') >>> cur = con.cursor() >>> cur.execute("CREATE TABLE Keywords (Url text, Word text, Freq int)") >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") Hardcoded values

42 Introduction to Computing Using Python Parameter substitution In general, the values used in an SQL statement will not be hardcoded in the program but come from Python variables >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>>

43 Introduction to Computing Using Python Parameter substitution Parameter substitution is the technique used to construct SQL statements that make use of Python variable values similar to string formatting >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) tuple

44 Introduction to Computing Using Python Parameter substitution >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) Parameter substitution is the technique used to construct SQL statements that make use of Python variable values similar to string formatting

45 Introduction to Computing Using Python Parameter substitution Changes to a database file are not written to the database file immediately; they are only recorded temporarily, in memory In order to ensure that the changes are written to the database file, the commit() method must be called on the Connection object >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) >>> con.commit() >>> >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) >>> con.commit() >>> A database file should be closed just like any other file >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) >>> con.commit() >>> con.close() >>> cur.execute("INSERT INTO Keywords VALUES ('one.html', 'Beijing', 3)") >>> url, word, freq = 'one.html', 'Paris', 5 >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", (url, word, freq)) >>> record = ('one.html','Chicago', 5) >>> cur.execute("INSERT INTO Keywords VALUES (?, ?, ?)", record) >>> con.commit() >>> con.close()

46 Introduction to Computing Using Python Querying a database >>> import sqlite3 >>> con = sqlite3.connect('links.db') >>> cur = con.cursor() >>> cur.execute('SELECT * FROM Keywords') >>> cur.fetchall() [('one.html', 'Beijing', 3), ('one.html', 'Paris', 5), ('one.html', 'Chicago', 5), ('two.html', 'Bogota', 5), ('two.html', 'Beijing', 2), ('two.html', 'Paris', 1), ('three.html', 'Chicago', 3), ('three.html', 'Beijing', 6), ('four.html', 'Chicago', 3), ('four.html', 'Paris', 2), ('four.html', 'Nairobi', 5), ('five.html', 'Nairobi', 7), ('five.html', 'Bogota', 2)] >>> >>> import sqlite3 >>> con = sqlite3.connect('links.db') >>> cur = con.cursor() >>> cur.execute('SELECT * FROM Keywords') >>> cur.fetchall() [('one.html', 'Beijing', 3), ('one.html', 'Paris', 5), ('one.html', 'Chicago', 5), ('two.html', 'Bogota', 5), ('two.html', 'Beijing', 2), ('two.html', 'Paris', 1), ('three.html', 'Chicago', 3), ('three.html', 'Beijing', 6), ('four.html', 'Chicago', 3), ('four.html', 'Paris', 2), ('four.html', 'Nairobi', 5), ('five.html', 'Nairobi', 7), ('five.html', 'Bogota', 2)] >>> The result of a query is stored in the Cursor object To obtain the result as a list of tuple objects, Cursor method fetchall() is used

47 Introduction to Computing Using Python Querying a database >>> cur.execute('SELECT * FROM Keywords') >>> for record in cur: print(record) ('one.html', 'Beijing', 3) ('one.html', 'Paris', 5) ('one.html', 'Chicago', 5) ('two.html', 'Bogota', 5) ('two.html', 'Beijing', 2) ('two.html', 'Paris', 1) ('three.html', 'Chicago', 3) ('three.html', 'Beijing', 6) ('four.html', 'Chicago', 3) ('four.html', 'Paris', 2) ('four.html', 'Nairobi', 5) ('five.html', 'Nairobi', 7) ('five.html', 'Bogota', 2) >>> >>> cur.execute('SELECT * FROM Keywords') >>> for record in cur: print(record) ('one.html', 'Beijing', 3) ('one.html', 'Paris', 5) ('one.html', 'Chicago', 5) ('two.html', 'Bogota', 5) ('two.html', 'Beijing', 2) ('two.html', 'Paris', 1) ('three.html', 'Chicago', 3) ('three.html', 'Beijing', 6) ('four.html', 'Chicago', 3) ('four.html', 'Paris', 2) ('four.html', 'Nairobi', 5) ('five.html', 'Nairobi', 7) ('five.html', 'Bogota', 2) >>> An alternative is to iterate over the Cursor object

48 Introduction to Computing Using Python Querying a database >>> word = 'Paris' >>> cur.execute('SELECT Url FROM Keywords WHERE Word = ?', (word,)) >>> cur.fetchall() [('one.html',), ('two.html',), ('four.html',)] >>> word, n = 'Beijing', 2 >>> cur.execute("SELECT * FROM Keywords WHERE Word = ? AND Freq > ?", (word, n)) >>> cur.fetchall() [('one.html', 'Beijing', 3), ('three.html', 'Beijing', 6)] >>> >>> word = 'Paris' >>> cur.execute('SELECT Url FROM Keywords WHERE Word = ?', (word,)) >>> cur.fetchall() [('one.html',), ('two.html',), ('four.html',)] >>> word, n = 'Beijing', 2 >>> cur.execute("SELECT * FROM Keywords WHERE Word = ? AND Freq > ?", (word, n)) >>> cur.fetchall() [('one.html', 'Beijing', 3), ('three.html', 'Beijing', 6)] >>> Parameter substitution is again used whenever Python variable values are needed in the SQL statement

49 Introduction to Computing Using Python List comprehension >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> Suppose we want to construct a list from an “old” list by modifying each “old” list item in the same way ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] ['First Line', 'Second', '', 'and Fourth.'] >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> newlines = [] >>> for i in range(len(lines)): newlines.append(lines[i][:-1]) >>> newlines ['First Line', 'Second', '', 'and Fourth.'] >>> >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> newlines = [] >>> for i in range(len(lines)): newlines.append(lines[i][:-1]) >>> newlines ['First Line', 'Second', '', 'and Fourth.'] >>> >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> newlines = [] >>> for i in range(len(lines)): newlines.append(lines[i][:-1]) >>> newlines ['First Line', 'Second', '', 'and Fourth.'] >>> newlines = [line[:-1] for line in lines] >>> newlines ['First Line', 'Second', '', 'and Fourth.'] >>> lines ['First Line\n', 'Second\n', '\n', 'and Fourth.\n'] >>> newlines = [] >>> for i in range(len(lines)): newlines.append(lines[i][:-1]) >>> newlines ['First Line', 'Second', '', 'and Fourth.'] >>> newlines = [line[:-1] for line in lines] >>> newlines ['First Line', 'Second', '', 'and Fourth.'] Method 1: accumulator pattern Method 2: list comprehension lines newlines

50 Introduction to Computing Using Python List comprehension >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >> >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >> The syntax of the list comprehension statement: [ for in ] [ for in if ] More generally: Examples: >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >>> [i for i in range(0, 20, 2)] [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >>> [i for i in range(0, 20, 2)] [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >>> [i for i in range(0, 20, 2)] [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> [len(word) for word in ['hawk', 'hen', 'hog', 'hyena'] >>> [line[:-1] for line in lines if line != '\n'] ['First Line', 'Second', 'and Fourth.'] >>> [i for i in range(0, 20, 2)] [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] >>> [len(word) for word in ['hawk', 'hen', 'hog', 'hyena']

51 Introduction to Computing Using Python MapReduce >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] Suppose we would like to compute the frequency of every word in a list the result would be [('one', 2), ('five', 2), ('two', 1), ('three', 3)] So, for list We have done this before using a dictionary and the accumulator loop pattern We will now solve this problem using MapReduce

52 Introduction to Computing Using Python MapReduce 'two' 'three' 'one' 'three' 'one' 'five' input list [('two', 1)] [('three', 1)] [('one', 1)] [('three', 1)] [('one', 1)] [('five', 1)] intermediate1 ('two', [1]) ('three', [1,1,1]) ('one', [1,1]) ('five', [1,1]) intermediate2 ('two', 1) ('three', 3) ('one', 2) ('five', 2) output list Map stepPartition step Reduce step

53 Introduction to Computing Using Python MapReduce 'two' 'three' 'one' 'three' 'one' 'five' input list [('two', 1)] [('three', 1)] [('one', 1)] [('three', 1)] [('one', 1)] [('five', 1)] intermediate1 ('two', [1]) ('three', [1,1,1]) ('one', [1,1]) ('five', [1,1]) intermediate2 ('two', 1) ('three', 3) ('one', 2) ('five', 2) output list >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> intermediate2 = partition(intermediate1) >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> intermediate2 = partition(intermediate1) >>> >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> intermediate2 = partition(intermediate1) >>> [occurrenceCount(x) for x in intermediate2] [('one', 2), ('five', 2), ('two', 1), ('three', 3)] >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> intermediate1 = [occurrence(word) for word in words] >>> intermediate2 = partition(intermediate1) >>> [occurrenceCount(x) for x in intermediate2] [('one', 2), ('five', 2), ('two', 1), ('three', 3)] def occurrence(word): 'returns list containing tuple (word, 1)' return [(word, 1)] def occurrence(word): 'returns list containing tuple (word, 1)' return [(word, 1)] ch11.py def occurrenceCount(keyVal): '''takes tuple keyVal = (key, lst) as input and returns (key, sum(lst))''' return (keyVal[0], sum(keyVal[1])) def occurrenceCount(keyVal): '''takes tuple keyVal = (key, lst) as input and returns (key, sum(lst))''' return (keyVal[0], sum(keyVal[1])) def partition(intermediate1): # to do def partition(intermediate1): # to do

54 Introduction to Computing Using Python MapReduce [('two', 1)] [('three', 1)] [('one', 1)] [('three', 1)] [('one', 1)] [('five', 1)] intermediate1 ('two', [1]) ('three', [1,1,1]) ('one', [1,1]) ('five', [1,1]) intermediate2 ch11.py def partition(intermediate1): dct = {} # for every list lst of intermediate1 for lst in intermediate1: # for every (key, value) pair in list lst for key, value in lst: if key in dct: dct[key].append(value) else: dct[key] = [value] # return container of (key, values) tuples return dct.items() # return intermediate2 def partition(intermediate1): dct = {} # for every list lst of intermediate1 for lst in intermediate1: # for every (key, value) pair in list lst for key, value in lst: if key in dct: dct[key].append(value) else: dct[key] = [value] # return container of (key, values) tuples return dct.items() # return intermediate2

55 Introduction to Computing Using Python MapReduce abstracted ch11.py def partition(intermediate1): # implementation here class SeqMapReduce(object): 'a sequential MapReduce implementation' def __init__(self, mapper, reducer): 'functions mapper and reducer are problem specific' self.mapper = mapper self.reducer = reducer def process(self, data): 'runs MapReduce on data with mapper and reducer functions' intermediate1 = [self.mapper(x) for x in data] # Map intermediate2 = partition(intermediate1) return [self.reducer(x) for x in intermediate2] # Reduce def partition(intermediate1): # implementation here class SeqMapReduce(object): 'a sequential MapReduce implementation' def __init__(self, mapper, reducer): 'functions mapper and reducer are problem specific' self.mapper = mapper self.reducer = reducer def process(self, data): 'runs MapReduce on data with mapper and reducer functions' intermediate1 = [self.mapper(x) for x in data] # Map intermediate2 = partition(intermediate1) return [self.reducer(x) for x in intermediate2] # Reduce The MapReduce framework applies to a range of problems and therefore should be abstracted: >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(words) [('one', 2), ('five', 2), ('two', 1), ('three', 3)] >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(words) [('one', 2), ('five', 2), ('two', 1), ('three', 3)] >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(words) [('one', 2), ('five', 2), ('two', 1), ('three', 3)] >>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)] >>> words = ['two', 'three', 'one', 'three', 'three', 'five', 'one', 'five'] >>> smr = SeqMapReduce(occurrence, occurrenceCount) >>> smr.process(words) [('one', 2), ('five', 2), ('two', 1), ('three', 3)] >>> numbers = [2,3,4,3,2,3,5,4,3,5,1] >>> smr.process(numbers) [(1, 1), (2, 2), (3, 4), (4, 2), (5, 2)]

56 A solution to the problem could be represented as a mapping that maps each word to the list of files containing it This mapping is called an inverted index Introduction to Computing Using Python Inverted index problem Given several text files, we want to know which words appear in which file. [('Paris', ['a.txt', 'c.txt']), ('Miami', ['a.txt']), ('Cairo', ['c.txt']), ('Quito', ['b.txt', 'c.txt']), ('Tokyo', ['a.txt', 'b.txt'])] [('Paris', ['a.txt', 'c.txt']), ('Miami', ['a.txt']), ('Cairo', ['c.txt']), ('Quito', ['b.txt', 'c.txt']), ('Tokyo', ['a.txt', 'b.txt'])] Paris: Miami, Miami Tokyo, Miami Paris: Miami, Miami Tokyo, Miami a.txt Tokyo Quito... Tokyo. Quito Tokyo Quito... Tokyo. Quito b.txt Paris, Quito. Cairo, Paris, Quito. Paris, Quito. Cairo, Paris, Quito. c.txt To apply MapReduce, we need to define the mapper and reducer functions

57 Introduction to Computing Using Python Inverted index problem a.txt b.txt c.txt input list (Tokyo, [a.txt, b.txt]) (Paris, [a.txt, c.txt]) (Miami, [a.txt]) (Quito, [b.txt]) intermediate2 (Cairo, [c.txt]) (...) output list (...) [(Tokyo, a.txt (Paris, a.txt) (Miami, a.txt)] (Tokyo, b.txt) (Quito, b.txt) (Paris, c.txt) (Cairo, c.txt) intermediate1 Paris: Miami, Miami Tokyo, Miami Paris: Miami, Miami Tokyo, Miami a.txt Tokyo Quito... Tokyo. Quito Tokyo Quito... Tokyo. Quito b.txt Paris, Quito. Cairo, Paris, Quito. Paris, Quito. Cairo, Paris, Quito. c.txt

58 Introduction to Computing Using Python MapReduce a.txt b.txt c.txt input list (Tokyo, [a.txt, b.txt]) (Paris, [a.txt, c.txt]) (Miami, [a.txt]) (Quito, [b.txt]) intermediate2 (Cairo, [c.txt]) (...) output list (...) [(Tokyo, a.txt (Paris, a.txt) (Miami, a.txt)] (Tokyo, b.txt) (Quito, b.txt) (Paris, c.txt) (Cairo, c.txt) intermediate1 from string import punctuation def getWordsFromFile(file): 'returns set of items (word, file) for every word in file' infile = open(file) content = infile.read() infile.close() # remove punctuation transTable = str.maketrans(punctuation, ' '*len(punctuation)) content = content.translate(transTable) # construct set of items (word, file) with no duplicates res = set() for word in content.split(): res.add((word, file)) return res # return intermediate1 from string import punctuation def getWordsFromFile(file): 'returns set of items (word, file) for every word in file' infile = open(file) content = infile.read() infile.close() # remove punctuation transTable = str.maketrans(punctuation, ' '*len(punctuation)) content = content.translate(transTable) # construct set of items (word, file) with no duplicates res = set() for word in content.split(): res.add((word, file)) return res # return intermediate1 def getWordIndex(keyVal): 'returns input value' return keyVal def getWordIndex(keyVal): 'returns input value' return keyVal Mapper Reducer intermediate2 is actually the desired list so the reducer just copies its items to the output list

59 Introduction to Computing Using Python Module multiprocessing Standard Library module multiprocessing includes tools that make it possible to execute Python programs in parallel on multi-core machines >>> from multiprocessing import cpu_count >>> cpu_count() 8 >>> from multiprocessing import cpu_count >>> cpu_count() 8 So 8 cores (your computer may have more or less) Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel (i.e. at the same time) on separate cores A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on a processor core How many processor cores does a given computer have? Let’s check: Note: process != core

60 Introduction to Computing Using Python Class Pool in module multiprocessing > python parallel.py [4, 3, 3, 5] > python parallel.py [4, 3, 3, 5] from multiprocessing import Pool animals = ['hawk', 'hen', 'hog', 'hyena'] pool = Pool(2) # create pool of 2 processes res = pool.map(len, animals) # apply len() to every animals item print(res) # print the list of string lengths from multiprocessing import Pool animals = ['hawk', 'hen', 'hog', 'hyena'] pool = Pool(2) # create pool of 2 processes res = pool.map(len, animals) # apply len() to every animals item print(res) # print the list of string lengths Class Pool from module multiprocessing can be used to split a problem and execute its pieces in parallel. A Pool object represents a pool of one or more processes, each of which is capable of executing code independently on an available processor core parallel.py Execute this program from a OS shell (not the Python interpreter shell):

61 Introduction to Computing Using Python Class Pool in module multiprocessing > python parallel.py [4, 3, 3, 5] > python parallel.py [4, 3, 3, 5] from multiprocessing import Pool animals = ['hawk', 'hen', 'hog', 'hyena'] pool = Pool(2) # create pool of 2 processes res = pool.map(len, animals) # apply len() to every animals item print(res) # print the list of string lengths from multiprocessing import Pool animals = ['hawk', 'hen', 'hog', 'hyena'] pool = Pool(2) # create pool of 2 processes res = pool.map(len, animals) # apply len() to every animals item print(res) # print the list of string lengths parallel.py Execute this program from a OS shell (not the Python interpreter shell): The statement and the statement do the same thing (they construct a list by applying len() to every item of list animal ) pool.map(len, animals) [len(x) for x in animals] It is how they do it that is different: executed by 2 processes executed by 1 process

62 Introduction to Computing Using Python Class Pool in module multiprocessing from multiprocessing import Pool from os import getpid def length(word): 'returns length of string word' # print the id of the process executing the function print('Process {} handling {}'.format(getpid(), word)) return len(word) # main program pool = Pool(2) res = pool.map(length, ['hawk', 'hen', 'hog', 'hyena']) print(res) from multiprocessing import Pool from os import getpid def length(word): 'returns length of string word' # print the id of the process executing the function print('Process {} handling {}'.format(getpid(), word)) return len(word) # main program pool = Pool(2) res = pool.map(length, ['hawk', 'hen', 'hog', 'hyena']) print(res) parallel2.py Let’s verify that different processes are handling different list items > python parallel2.py Process 5129 handling hawk Process 5130 handling hen Process 5129 handling hog Process 5130 handling hyena [4, 3, 3, 5] > python parallel2.py Process 5129 handling hawk Process 5130 handling hen Process 5129 handling hog Process 5130 handling hyena [4, 3, 3, 5] every process has a unique id

63 Introduction to Computing Using Python Parallel spedup The benefit of using a pool of independent processes is they can be scheduled by the CPU scheduler to execute in parallel on separate cores This should result in faster program running time and parallel speedup To showcase this, let’s consider a computationally intensive problem from number theory: compare the distribution of prime numbers in several ranges of integers Count the number of prime numbers in several equal-size ranges of 100,000 large integers def countPrimes(start): 'returns the number of primes in range [start, start+rng)' rng = 100000 formatStr = 'process {} processing range [{}, {})' print(formatStr.format(getpid(), start, start+rng)) # sum up numbers i in range [start, start_rng) that are prime return sum([1 for i in range(start,start+rng) if isprime(i)]) def countPrimes(start): 'returns the number of primes in range [start, start+rng)' rng = 100000 formatStr = 'process {} processing range [{}, {})' print(formatStr.format(getpid(), start, start+rng)) # sum up numbers i in range [start, start_rng) that are prime return sum([1 for i in range(start,start+rng) if isprime(i)]) primeDensity.py

64 Introduction to Computing Using Python Parallel spedup def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(1) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(1) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) primeDensity.py If the Pool contains only 1 process > python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds. > python map.py process 4176 processing range [12345678, 12445678] process 4176 processing range [23456789, 23556789] process 4176 processing range [34567890, 34667890] process 4176 processing range [45678901, 45778901] process 4176 processing range [56789012, 56889012] process 4176 processing range [67890123, 67990123] process 4176 processing range [78901234, 79001234] process 4176 processing range [89012345, 89112345] [6185, 5900, 5700, 5697, 5551, 5572, 5462, 5469] Time taken: 47.84 seconds.

65 def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(2) # starts in a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(2) # starts in a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) Introduction to Computing Using Python Parallel spedup primeDensity.py If the Pool contains 2 processes Time taken: 24.60 seconds. Speedup = parallel time/sequential time = 47.84/24.6 ≈1.94 Using 2 processes on 2 cores instead of 1 process on 1 core descreased the running time from 47.84 to 24.6 seconds`

66 def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(4) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(4) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) Introduction to Computing Using Python Parallel spedup primeDensity.py If the Pool contains 4 processes Time taken: 16.78 seconds. Speedup = 47.84/16.78 ≈2.85

67 def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(8) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) def countPrimes(start): # not shown if __name__ == '__main__': p = Pool(8) # starts is a list of left boundaries of integer ranges starts = [12345678, 23456789, 34567890, 45678901, 56789012, 67890123, 78901234, 89012345] t1 = time() # start time print(p.map(countPrimes,starts)) t2 = time() # end time p.close() print('Time taken: {} seconds.'.format(t2-t1)) Introduction to Computing Using Python Parallel speedup primeDensity.py If the Pool contains 8 processes Time taken: 14.29 seconds. Speedup = 47.84/14.29 ≈3.35

68 from multiprocessing import Pool class MapReduce(object): 'a parallel implementation of MapReduce' def __init__(self, mapper, reducer, numProcs = None): 'initializes map and reduce functions and process pool' self.mapper = mapper self.reducer = reducer self.pool = Pool(numProcs) def process(self, data): 'runs MapReduce on sequence data' intermediate1 = self.pool.map(self.mapper, data) # Map intermediate2 = partition(intermediate1) return self.pool.map(self.reducer, intermediate2) # Reduce from multiprocessing import Pool class MapReduce(object): 'a parallel implementation of MapReduce' def __init__(self, mapper, reducer, numProcs = None): 'initializes map and reduce functions and process pool' self.mapper = mapper self.reducer = reducer self.pool = Pool(numProcs) def process(self, data): 'runs MapReduce on sequence data' intermediate1 = self.pool.map(self.mapper, data) # Map intermediate2 = partition(intermediate1) return self.pool.map(self.reducer, intermediate2) # Reduce Introduction to Computing Using Python ch12.py MapReduce in parallel MapReduce reimplemented using a pool of processes and method map()

69 Introduction to Computing Using Python The name cross-checking problem Tens of thousands of previously classified documents have just been posted on the web. You want to find out which documents mention a particular person, and you want to do that for every person named in one or more documents. Assume that people’s names are capitalized, which helps you narrow down the words that can be proper names. The precise problem is then: given a list of URLs (of the documents), obtain a list of pairs (proper, urlList) in which proper is a capitalized word in any document and urlList is a list of URLs of documents containing proper In order to use MapReduce, we need to define the map and reduce functions

70 Introduction to Computing Using Python The name cross-checking problem The map function takes a URL as input and returns a list of tuples (word, URL) for every word that is capitalized in the document identified by the URL from urllib.request import urlopen from re import findall def getProperFromURL(url): '''returns list of items (word, url) for every capitalized word in the document identified by url''' content = urlopen(url).read().decode() pattern = '[A-Z][A-Za-z\'\-]*' # RE for capitalized words # collect al capitalized words and remove duplicates propers = set(findall(pattern, content)) res = [] for word in propers: # for every capitalized word # create pair (word, url) and append to res res.append((word, url)) return res from urllib.request import urlopen from re import findall def getProperFromURL(url): '''returns list of items (word, url) for every capitalized word in the document identified by url''' content = urlopen(url).read().decode() pattern = '[A-Z][A-Za-z\'\-]*' # RE for capitalized words # collect al capitalized words and remove duplicates propers = set(findall(pattern, content)) res = [] for word in propers: # for every capitalized word # create pair (word, url) and append to res res.append((word, url)) return res crosscheck.py

71 Introduction to Computing Using Python The name cross-checking problem The partition function will, for every capitalized word, collect all tuples (word, url) in every list in intermediate1 to construct list intermediate2 containing pairs (word, [url1, url2,...]) def getWordIndex(keyVal): 'returns input value' return keyVal def getWordIndex(keyVal): 'returns input value' return keyVal Since intermediate2 contains the desired result (mapping of capitalized words to urls), the reducer function just returns its input crosscheck.py

72 Introduction to Computing Using Python The name cross-checking problem from time import time if __name__ == '__main__': urls = [ # URLS of eight Charles Dickens novels 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt', 'http://www.gutenberg.org/cache/epub/1400/pg1400.txt', 'http://www.gutenberg.org/cache/epub/46/pg46.txt', 'http://www.gutenberg.org/cache/epub/730/pg730.txt', 'http://www.gutenberg.org/cache/epub/766/pg766.txt', 'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', 'http://www.gutenberg.org/cache/epub/580/pg580.txt', 'http://www.gutenberg.org/cache/epub/786/pg786.txt'] t1 = time() # sequential start time SeqMapReduce(getProperFromURL, getWordIndex).process(urls) t2 = time() # sequential stop time, parallel start time MapReduce(getProperFromURL, getWordIndex, 4).process(urls) t3 = time() # parallel stop time print('Sequential: {:5.2f} seconds.'.format(t2-t1)) print('Parallel: {:5.2f} seconds.'.format(t3-t2)) from time import time if __name__ == '__main__': urls = [ # URLS of eight Charles Dickens novels 'http://www.gutenberg.org/cache/epub/2701/pg2701.txt', 'http://www.gutenberg.org/cache/epub/1400/pg1400.txt', 'http://www.gutenberg.org/cache/epub/46/pg46.txt', 'http://www.gutenberg.org/cache/epub/730/pg730.txt', 'http://www.gutenberg.org/cache/epub/766/pg766.txt', 'http://www.gutenberg.org/cache/epub/1023/pg1023.txt', 'http://www.gutenberg.org/cache/epub/580/pg580.txt', 'http://www.gutenberg.org/cache/epub/786/pg786.txt'] t1 = time() # sequential start time SeqMapReduce(getProperFromURL, getWordIndex).process(urls) t2 = time() # sequential stop time, parallel start time MapReduce(getProperFromURL, getWordIndex, 4).process(urls) t3 = time() # parallel stop time print('Sequential: {:5.2f} seconds.'.format(t2-t1)) print('Parallel: {:5.2f} seconds.'.format(t3-t2)) > python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds. > python properNames.py Sequential: 19.89 seconds. Parallel: 14.81 seconds. Let’s compare the sequential and parallel implementations of MapReduce by cross-checking the proper names in 8 Charles Dickens’ novels: crosscheck.py


Download ppt "Introduction to Computing Using Python Data Storage and Processing  Databases and SQL  Python Database Programming  List comprehension and MapReduce."

Similar presentations


Ads by Google