Python: Programming the Google Search

Slides:



Advertisements
Similar presentations
Dictionaries: Keeping track of pairs
Advertisements

1 2/14/05CS120 The Information Era Searching the Web Don’t we already know how to do this?
 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.
Assignment: Improving search rank – search engine optimization Read the following post carefully.
1 Advanced Searching Use Query Languages. Use more than one search engine. –Or metasearches like at Start with simple searches. Add.
Python and Web Programming
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
1 Spidering the Web in Python CSC 161: The Art of Programming Prof. Henry Kautz 11/23/2009.
Bill's Amazing Content Rotator jQuery Content Rotator.
Data Collections: Dictionaries CSC 161: The Art of Programming Prof. Henry Kautz 11/4/2009.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
CSCI-235 Micro-Computer in Science Internet Search.
Copyright ©: SAMSUNG & Samsung Hope for Youth. All rights reserved Tutorials The internet: Web searches Suitable for: Beginner Improver.
Understanding Search Engines
1 CSC 221: Computer Programming I Spring 2008 ArrayLists and arrays  example: letter frequencies  autoboxing/unboxing  ArrayLists vs. arrays  example:
Do's and don'ts to improve your site's ranking … Presentation by:
HTML - Examples Be sure to check speaker notes for additional information!
INTRODUCTION TO HTML5 Semantic Layout in HTML5.  The new semantic layout in HTML5 refers to a new class of elements that is designed to help you understand.
Searching Damian Gordon. Google PageRank Damian Gordon.
Fall Week 4 CSCI-141 Scott C. Johnson.  Computers can process text as well as numbers ◦ Example: a news agency might want to find all the articles.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
Search Strategies for the World Wide Web. Search Engines  SearchEngineWatch.com will periodically rank the most effective search engines.  Some Search.
Python: Arrays Damian Gordon. Arrays In Python arrays are sometimes called “lists” or “tuple” but we’ll stick to the more commonly used term “array”.
More about Strings. String Formatting  So far we have used comma separators to print messages  This is fine until our messages become quite complex:
Unit 1—Computer Basics Lesson 3 The Internet and Research.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Introduction to Computing Using Python Dictionaries: Keeping track of pairs  Class dict  Class tuple.
Internet Search Operators Richard Goldman January 26, 2000.
Search Engine Optimization Presented By:- ARKA Softwares Effective! Affordable! Time Groove
Python: Revision Damian Gordon. Python: Structured Programming Damian Gordon.
Lists/Dictionaries. What we are covering Data structure basics Lists Dictionaries Json.
Python: Programming the Google Search (Crawling) Damian Gordon.
Python: Exception Handling Damian Gordon. Exception Handling When an error occurs in a program that causes the program to crash, we call that an “exception”
Search Engines and Search techniques
IGCSE 4 Cambridge Data types and arrays Computer Science Section 2
Python – May 18 Quiz Relatives of the list: Tuple Dictionary Set
Arrays: Checkboxes and Textareas
Containers and Lists CIS 40 – Introduction to Programming in Python
Hashing CSE 2011 Winter July 2018.
Dictionaries, File operations
HTML Vocabulary.
IST256 : Applications Programming for Information Systems
Introduction to Python
Strings.
Introduction to Python
Winter 2018 CISC101 12/1/2018 CISC101 Reminders
Guide to Programming with Python
Lesson Objectives Aims You should know about: – Web Technologies
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Basic Inheritance Damian Gordon.
Dictionaries 1/17/2019 7:55 AM Hash Tables   4
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
EBSCOhost Page Composer
IST256 : Applications Programming for Information Systems
HTML Links.
Anatomy of a Search Search The Index:
CSC 221: Computer Programming I Fall 2009
Dictionaries: Keeping track of pairs
Python: Sorting – Selection Sort
Python Strings.
Understanding Search Engines
Hash Maps Implementation and Applications
Migrating to Object-Oriented Programs
MWDL Website Redesign.
Web Programming and Design
Presentation transcript:

Python: Programming the Google Search Damian Gordon Ref: Donaldson, 2013, Python: Visual QuickStart Guide, Chap.11

How Google works How does Google search work?

How Google works We’ll recall that we discussed the Google search engine before. Google search sends programs called “spiders” to explore the web and find webpages, and take copies of those pages.

How Google works Then it scans those pages and identifies words and phrases in each page, and stores a count of how many times each word appears in a page, this will help determine where each page is ranked in the search order.

How Google works Another factor in the ranking of pages is the type of pages that link to each page, if a page is considered to be of good quality if well-know pages (like government websites, or newspaper sites) link to it:

How Google works

Handling Strings

“ ”

Handling Strings s = “A long time ago in a galaxy far, far away ...” s.split() ['a', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far,', 'far', 'away', '...']

Handling Strings s = “A long time ago in a galaxy far, far away ...” s.split() ['a', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far,', 'far', 'away', '...']

Handling Strings s = “A long time ago in a galaxy far, far away ...” s.split() >>> [‘A', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far,', 'far', 'away', '...']

Handling Strings t = s.split() print(t) >>> [‘A', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far,', 'far', 'away', '...']

Handling Strings t = s.split() print(t) >>> [‘A', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far,', 'far', 'away', '...']

Handling Strings len(t) >>> 11

Handling Strings s = “A long time ago in a galaxy far, far away ...” s2 = “a long time ago in a galaxy far far away” Pre-processing

Handling Strings s = “A long time ago in a galaxy far, far away ...” s2 = “a long time ago in a galaxy far far away” Pre-processing

What preprocessing features does Python provide?

Handling Strings s3 = "DARTH VADER: Obi-Wan never told you what happened to your father. LUKE: He told me enough. He told me you killed him. DARTH VADER: No. I am your father."

Handling Strings s3.lower() >>> 'darth vader: obi-wan never told you what happened to your father. luke: he told me enough. he told me you killed him. darth vader: no. i am your father.'

Handling Strings s3 = "DARTH VADER: Obi-Wan never told you what happened to your father. LUKE: He told me enough. He told me you killed him. DARTH VADER: No. I am your father."

Handling Strings s3.replace('DARTH VADER', ‘Darth') >>> ‘Darth: Obi-Wan never told you what happened to your father. LUKE: He told me enough. He told me you killed him. Darth: No. I am your father.'

Handling Strings s2 = “a long time ago in a galaxy far far away” t2 = s.split() print(t2) >>> [‘a', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far', 'far', 'away']

Handling Strings Remember sets from school?

Handling Strings Remember sets from school? A key rule of a set is that there can be no repeated elements in a set (no value can be duplicated).

Handling Strings Remember sets from school? A key rule of a set is that there can be no repeated elements in a set (no value can be duplicated). Python provides a set() function.

Handling Strings print(t2) >>> [‘a', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far', 'far', 'away'] u2 =set(t2) Print(u2) {'away', 'long', 'far', 'ago', 'galaxy', 'in', 'a', 'time'}

Handling Strings print(t2) >>> [‘a', 'long', 'time', 'ago', 'in', 'a', 'galaxy', 'far', 'far', 'away'] u2 =set(t2) Print(u2) >>> {'away', 'long', 'far', 'ago', 'galaxy', 'in', 'a', 'time'}

Handling Strings len(u2) >>> 8

String Preprocessing # PROGRAM String Preprocessing keep = {'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', ' ', '-', "'"} ################################### Continued 

 Continued String Preprocessing def Normalise(s): result = '' for x in s.lower(): # DO if x in keep: # THEN result = result + x # Add current char to result # ELSE # Do not add current char to result # ENDIF; # ENDFOR; return result # END Normalise. Continued 

 Continued String Preprocessing ######### MAIN PRORGAM ######### Quote = "A long time ago in a galaxy far, far away ..." NewQuote = Normalise(Quote) print(Quote) print(NewQuote) # END.

String Preprocessing A long time ago in a galaxy far, far away ...  Continued String Preprocessing ######### MAIN PRORGAM ######### Quote = "A long time ago in a galaxy far, far away ..." NewQuote = Normalise(Quote) print(Quote) print(NewQuote) # END. A long time ago in a galaxy far, far away ... a long time ago in a galaxy far far away

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) 326,359 script.count("\n") 8,150 len(script.split()) 33,101

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") 8,150 len(script.split()) 33,101

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") 8,150 len(script.split()) 33,101 Characters

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") >>> 8,150 len(script.split()) 33,101 Characters

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") >>> 8,150 len(script.split()) 33,101 Characters Lines

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") >>> 8,150 len(script.split()) >>> 33,101 Characters Lines

Handling Strings in Files script = open('StarWarsscript.txt','r').read() len(script) >>> 326,359 script.count("\n") >>> 8,150 len(script.split()) >>> 33,101 Characters Lines Words

The Dictionary Type

The Dictionary Type Python has a built-in data structure called a “dictionary”. It’s like an array, except instead of being indexed by numbers, it can be indexed by any type.

The Dictionary Type Array Dictionary 31 1 41 2 59 26 3 53 4 5 59 6 66 31 1 41 2 59 26 3 53 4 5 59 6 66 Array 31 ‘a’ 41 ‘b’ 59 ‘c’ 26 ‘d’ 53 ‘e’ 59 ‘f’ 66 ‘g’ Dictionary

The Dictionary Type Title Title = {'the':1, 'force':2, 'awakens':3} 1

The Dictionary Type Title Title.keys() >>> dict_keys(['the', 'force', 'awakens']) ‘the’ 1 ‘force’ 2 3 ‘awakens’ Title

The Dictionary Type Title 1 2 3 Title['Star Wars'] = 0 ‘Star Wars’ print(Title) >>> {'Star Wars': 0, 'the': 1, 'force': 2, 'awakens': 3} ‘Star Wars’ ‘the’ 1 ‘force’ 2 3 ‘awakens’ Title

The Dictionary Type Let’s write a program to count the frequency of each of the words in a string, using a dictionary.

The Dictionary Type def FreqDict(s): NewString = Normalise(s) words = NewString.split() Dict = {} for wordindex in words: if wordindex in Dict: Dict[wordindex] = Dict[wordindex] + 1 else: Dict[wordindex] = 1 # ENDIF; # ENDFOR; return Dict # END FreqDict.

The Dictionary Type Preprocessing def FreqDict(s): NewString = Normalise(s) words = NewString.split() Dict = {} for wordindex in words: if wordindex in Dict: Dict[wordindex] = Dict[wordindex] + 1 else: Dict[wordindex] = 1 # ENDIF; # ENDFOR; return Dict # END FreqDict.

Create an empty dictionary The Dictionary Type Preprocessing def FreqDict(s): NewString = Normalise(s) words = NewString.split() Dict = {} for wordindex in words: if wordindex in Dict: Dict[wordindex] = Dict[wordindex] + 1 else: Dict[wordindex] = 1 # ENDIF; # ENDFOR; return Dict # END FreqDict. Create an empty dictionary

The Dictionary Type Preprocessing def FreqDict(s): NewString = Normalise(s) words = NewString.split() Dict = {} for wordindex in words: if wordindex in Dict: Dict[wordindex] = Dict[wordindex] + 1 else: Dict[wordindex] = 1 # ENDIF; # ENDFOR; return Dict # END FreqDict. Create an empty dictionary If that word is already found

The Dictionary Type A long time ago in a galaxy far, far away ... Preprocessing def FreqDict(s): NewString = Normalise(s) words = NewString.split() Dict = {} for wordindex in words: if wordindex in Dict: Dict[wordindex] = Dict[wordindex] + 1 else: Dict[wordindex] = 1 # ENDIF; # ENDFOR; return Dict # END FreqDict. Create an empty dictionary If that word is already found A long time ago in a galaxy far, far away ... {'ago': 1, 'time': 1, 'galaxy': 1, 'away': 1, 'long': 1, 'a': 2, 'far': 2, 'in': 1}

The Dictionary Type The only problem with the dictionary type is that it is indexed by words, not the values, so if we wanted to print out the top 20 most frequently occurring words in a string, that’d be a bit tricky. But we have a simple way of fixing that.

The Dictionary Type We can convert the dictionary into an array: DictArray = [ ] for k in Dict: # DO pair = (Dict[k], k) DictArray.append(pair) # ENDFOR;

etc.