Download presentation
Presentation is loading. Please wait.
Published byMelvin Bradley Modified over 9 years ago
1
ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches
2
How do Search Engines Crawl? Gathering data from the Web is like browsing: 1.Visit a page. 2.Record all the words on the page 3.Choose a link you haven’t seen/recorded 4.Click on the link. Repeat 8 billion times.
3
Crawling the Web One person with a Web browser, following one link per second. How long does it take to browse the surface Web (8 billion pages)? 8 billion seconds = 133 million minutes = 2 million hours = 93 thousand days = 256 years
4
Crawling the Web How many people would it take to crawl the surface Web in a week? If each person follows one link per second (with no sleep): One week = six hundred thousand seconds Six hundred thousand / eight billion = thirteen thousand
5
Challenges: Remembering where you’ve been Remembering where you haven’t been Storing all the data
6
A (small) Server Farm
7
The Deep Web Not all pages get crawled: Private pages on Intranets (company networks) Pages that people don’t want crawled Dynamic content pages (from databases) Dynamic content pages make the size of the Internet infinite!
8
Dynamic Content Example zillow.com Won’t be indexed
9
Identifying High Quality Web Pages Google has ranked billions of Web pages by "quality". You enter your search terms: UNC Charlotte HCI UNC Charlotte HCI Google finds the highest quality page associated with these search terms.
10
Google Pagerank Pretend you're surfing the Web randomly. To move from page to page you could: 1) type in an address (www.sis.uncc.edu) 1) type in an address (www.sis.uncc.edu)www. includes using a bookmark includes using a bookmark OR OR 2) follow a link. 2) follow a link. Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2). The main idea is that links to your page from important web pages indicate that your page is important.
11
Computing Pagerank (what’s the probability of getting to this page?) Q A, B, C,... L(A), L(B), L(C),... = Web page = Pages pointing to Q = number of links on each page Pagerank of Q: R(Q) = (1-d) + d · (R(A)/L(A) + R(B)/L(B) +...) d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark): Usually these are: d = 0.9 (1-d) = 0.1
12
Computing Pagerank Pretend the Web has only four pages: W X Y Z W X Y ZLinks: W X Y W Y Z Z W W X Y W Y Z Z W L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 Which page has the highest “quality”?
13
Computing Pagerank Links: W X Y W Y Z Z W L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z)) = 0.1 + 0.9 * (R(Y)/2 + R(Z)/1)) R(X) = 0.1 + 0.9 * R(W) R(Y) = 0.1 R(Z) = 0.1 + 0.9 * (R(Y)/2) Now, solve for: R(W), R(X), R(Y), R(Z)
14
Computing Values for R(W), R(X), R(Y) and R(Z) We could use algebra to find the values, in the same way we could solve for x and y in: x = 1 + 2x + y y = 2 + x + 3y
15
Algebraic Solution w = R(W) x = R(X) y = R(Y) z = R(Z) w = 0.1 + 0.45 y + 0.9z x = 0.1 + 0.9w y = 0.1 z = 0.1 + 0.45y y = 0.1 z = 0.145 w = 0.2775 x = 0.34795 But solving for eight billion variables is hard. Instead, we'll use fixed point iteration.
16
Solution by Fixed-Point Iteration Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (1.0/2 + 1.0) = 1.45 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.0 = 1.0 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (1.0/2) = 0.55 Start with initial estimates of PageRank for each page: R(W) = 1.0 R(X) = 1.0 R(Y) = 1.0 R(Z) = 1.0
17
Solution by Fixed-Point Iteration Start with updated estimates: R(W) = 1.45 R(X) = 1.0 R(Y) = 0.1 R(Z) = 0.55 Apply equations to compute new estimates: new R(W) = 0.1 + 0.9 * (R(Y)/2 + R(Z)) = 0.1 + 0.9 * (0.1/2 + 0.55) = 0.64 new R(X) = 0.1 + 0.9 *R(W) = 0.1 + 0.9 *1.45 = 1.405 new R(Y) = 0.1 new R(Z) = 0.1 + 0.9 * (R(Y)/2) = 0.1 + 0.9 * (0.1/2) = 0.145
18
Solution by Iteration iteration R(W) R(X) R(Y) R(Z) 0 1.00000 1.00000 1.00000 1.00000 1 1.45000 1.00000 0.10000 0.55000 2 0.64000 1.40500 0.10000 0.14500 Compute new estimates from the old until the estimates stop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better. 3 0.27550 0.67600 0.10000 0.14500 4 0.27550 0.34795 0.10000 0.14500 5 0.27550 0.34795 0.10000 0.14500............
19
Final Pageranks highestpage XR(X) = 0.34795.page WR(W) = 0.2755..page ZR(Z) = 0.14500 lowestpage YR(Y) = 0.10000
20
Final Pageranks Y W X Z 2 1 10 0.10000 0.34795 0.27550 0.14500
21
How does Google Use Pagerank? You enter search terms, such as “UNC Charlotte HCI” Google finds all the pages that have all those words on them Of all those pages, Google will list the ones with the highest page rank first, but… …other ‘magic ingredients’ are used by Google: trade secrets of their algorithms.
22
Introduction Basic queries are somewhat limited One or two keywords Simple relationships Limited syntax Complex queries provide more power Keywords & phrase can be connected to form more complex relationships Search filters can be employed to limit results
23
Understanding Boolean Operators
24
Syntax Rules for combining simple words to form complex sentences Search engine syntax implemented by applying Boolean logic George Boole 1815-1864
25
Understanding Boolean Operators
26
Boolean logic Keywords act as nouns Boolean operators act as conjunctions They define the connections between keywords Illustrated with Venn diagrams John Venn 1834-1923
27
Understanding Boolean Operators W W W All web pages containing the word cats
28
Understanding Boolean Operators W W W All web pages containing the word dogs
29
Understanding Boolean Operators W W W All web pages containing the words cats and dogs Intersection of the two sets Searches containing both words
30
Understanding Boolean Operators W W W All web pages containing the words cats or dogs Searches containing either word Union of the two sets
31
Understanding Boolean Operators W W W All web pages containing the words cats and not dogs Exclusion of the dogs set Searches containing one word but not the other
32
Understanding Boolean Operators W W W All web pages containing the words dogs and not cats Exclusion of the cats set Searches containing one word but not the other
33
Understanding Boolean Operators Boolean operators AND OR NOT Instruct the engine on how to combine keywords to produce results Always use capital letters to avoid confusion with and, or, not as keywords
34
Understanding Boolean Operators AND All these keywords must be on the Web page OR These keywords may or may not be on the Web page At least one of them must be NOT None of these keywords can be on the Web page
35
Understanding Boolean Operators Default operator Some engines have a default Boolean operator Usually AND Might be OR Some engines may search for multiple words as phrases
36
Understanding Boolean Operators Boolean operators may be Allowed on main page Confined to Advanced search pages Some engines use symbols instead + for AND - for NOT No space between sign and word: +solar +energy -windmill
37
Narrowing Searches with AND AND Limits results Forces inclusion of a stop word Indicates that all keywords must be found on Web page Adding more ANDed keywords limits search more Results should be more relevant because the keyword list has expanded
38
Narrowing Searches with AND Example: “solar energy association” AND Portland W W W Solar energy association Portland
39
Narrowing Searches with AND Example: Henry +I same as “Henry I” W W W Henry I
40
Expanding Searches with OR OR expands results Useful if you didn’t get enough returns from your first search The more keywords you add, the more results you should get Every page returned must have at least one of the keywords on it Good to use when you have synonyms
41
Expanding Searches with OR Example: oregon OR northwest W W W oregon northwest
42
Restricting Queries with AND NOT AND NOT excludes the keyword that follows NOT Limits your search Produces fewer results Useful if first search returns irrelevant results Use AND NOT to get rid of those results
43
Restricting Queries with AND NOT Equivalent forms: cats AND NOT dogs cats AND-NOT dogs cats NOT dogs cats –dogs
44
Restricting Queries with AND NOT Example: “solar energy association” AND portland AND NOT maine Solar energy association portland maine
45
Multiple Boolean Operators Boolean operators allow you to focus a search Any logical combination of operators is allowed If it makes sense when spoken like a sentence it’s probably OK to use Order of operations is usually left to right Use parentheses to organize terms
46
Multiple Boolean Operators Bad example: constitution +american OR “united states” constitution american “united states”
47
Multiple Boolean Operators Good example: constitution +(american OR “united states”) constitution american “united states”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.