ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches.

Slides:



Advertisements
Similar presentations
Boolean Operators. Locating Information The number of documents on the web have multiplied immensely over the last few years This means there is simply.
Advertisements

Tara Guthrie, 2010 BOOLEAN SEARCHING How it can help you do effective database and Internet searching.
Search Techniques Boolean Logic and Keyword Searching.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lyndi Petersen and Joseph Unwin
Effective Internet Searching. Why use the Internet Search for a question Research a topic Current research Variety of sources, a click away What other.
BOOLEAN SEARCHING Mrs. Peters. BACKGROUND Boolean searching is based on a system of symbolic logic which was developed by George Boole, who was a 19 th.
Boolean Operators Objective: Students will understand Boolean operators and how they function when searching for information.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Contents Boolean Logic Truncation Phrase Searching.
3.02 The Information Superhighway
The audio will be turned on just before our start time at 7:00 pm ET.
… and other search strategies that work!
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Slide No. 1 Searching the Web H Search engines and directories H Locating these resources H Using these resources H Interpreting results H Locating specific.
Tips to Researching on the Internet Guide for the Researching Tips Worksheet.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Component 4: Introduction to Information and Computer Science Unit 2: Internet and the World Wide Web Lecture 2 This material was developed by Oregon Health.
Boolean Searching In 20 minutes or less. Searching electronic databases Dynamic environment, constant development and change Each search engine is different.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
ITIS 1210 Introduction to Web-Based Information Systems Chapter 27 How Internet Searching Works.
Internet Business Foundations © 2004 ProsoftTraining All rights reserved.
Module 3- Searching and Sorting CSI Finding Information on the Internet.
Searching for Information and Library Databases. Knowing… When When Where Where How to find information isn’t easy How to find information isn’t easy.
LIR 10 Week 7 Boolean Searching and Online Periodical Databases.
LOGO Searching the Web CHAPTER 2 Eastern Mediterranean University School of Computing and Technology Department of Information Technology ITEC229 Client-Side.
The Internet October 30, The Internet URL’s Search Engines Boolean Operators Internet Searches Scavenger Hunt.
The Internet 8th Edition Tutorial 4 Searching the Web.
Lecture 4 Title: Search Engines By: Mr Hashem Alaidaros MKT 445.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search engines are used to for looking for documents. They compile their databases by employing "spiders" or "robots" to crawl through web space from.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
Computer Science 1000 Information Searching II Permission to redistribute these slides is strictly prohibited without permission.
Boolean Operators Are You Connected? Presented by: (Insert Your Name Here)
1 Internet Research Third Edition Unit A Searching the Internet Effectively.
BOOLEAN SEARCHING How it can help you do effective Internet and Alabama Virtual Library Searching.
Tips for Internet Searching (use these any time you need to find the correct information fast)
What is Google? Google is a popular web search engine— And learning techniques saves time and results in rewarding research.
Internet Research – Illustrated, Fourth Edition Unit B.
Internet Research – Illustrated, Fourth Edition Unit A.
CIW Lesson 6MBSH Mr. Schmidt1.  Define databases and database components  Explain relational database concepts  Define Web search engines and explain.
The ABC’s Of Search Engines Lesson 3: The Power Of The + and - Sign.
INTERNET RESEARCH. What is a search engine? A huge database of web page information, assembled by computers, that allows end-users to locate web sites.
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
WELCOME to Internet 102. Overview of Internet 102 Review of basic internet navigation Review of basic internet navigation Searching for and finding information.
Unit B Constructing Complex Searches Internet Research Third Edition.
Starter (June 2011) Explain two methods by which someone could find a website on the internet. [4]
SEPTEMBER 2015 Databases. Database (review) A database is a collection of data arranged for ease and speed of search and retrieval (The American Heritage.
 Every word matters. Generally, all the words you put in the query will be used.  Search is always case insensitive. A search for [ new york times ]
Improving Keywor d Searches OR: How to Be as Smart as the Internet.
Client-Side Internet and Web Programming
Fluency with Information Technology
How do Web Applications Work?
Education 499-R01 Search Basics.
Tips to Researching on the Internet
Search Engines and Search techniques
Types of Search Questions
Search Engines and Internet Resources
Internet Searching: Finding Quality Information
Tips to Researching on the Internet
Lesson 6: Databases and Web Search Engines
How to search like a pro WALT: Students will learn how to search the internet using Boolean Logic WILF: Students feel confident.
ITE 130 Web Searching.
Internet Research Third Edition
Lesson 6: Databases and Web Search Engines
Boolean Searching on the Internet
Presentation transcript:

ITIS 1210 Introduction to Web-Based Information Systems Internet Research Two How Search Engines Rank Pages & Constructing Complex Searches

How do Search Engines Crawl?  Gathering data from the Web is like browsing: 1.Visit a page. 2.Record all the words on the page 3.Choose a link you haven’t seen/recorded 4.Click on the link. Repeat 8 billion times.

Crawling the Web  One person with a Web browser, following one link per second.  How long does it take to browse the surface Web (8 billion pages)? 8 billion seconds = 133 million minutes = 2 million hours = 93 thousand days = 256 years

Crawling the Web  How many people would it take to crawl the surface Web in a week? If each person follows one link per second (with no sleep): One week = six hundred thousand seconds Six hundred thousand / eight billion = thirteen thousand

Challenges:  Remembering where you’ve been  Remembering where you haven’t been  Storing all the data

A (small) Server Farm

The Deep Web  Not all pages get crawled:  Private pages on Intranets (company networks)  Pages that people don’t want crawled  Dynamic content pages (from databases)  Dynamic content pages make the size of the Internet infinite!

Dynamic Content Example  zillow.com  Won’t be indexed

Identifying High Quality Web Pages  Google has ranked billions of Web pages by "quality".  You enter your search terms: UNC Charlotte HCI UNC Charlotte HCI  Google finds the highest quality page associated with these search terms.

Google Pagerank Pretend you're surfing the Web randomly. To move from page to page you could: 1) type in an address ( 1) type in an address ( includes using a bookmark includes using a bookmark OR OR 2) follow a link. 2) follow a link. Pagerank measures how likely you are to reach a particular page through random surfing (either 1 or 2). The main idea is that links to your page from important web pages indicate that your page is important.

Computing Pagerank (what’s the probability of getting to this page?) Q A, B, C,... L(A), L(B), L(C),... = Web page = Pages pointing to Q = number of links on each page Pagerank of Q: R(Q) = (1-d) + d · (R(A)/L(A) + R(B)/L(B) +...) d represents the relative chance of following a link to page Q and 1-d represents the relative chance of going directly to page Q (via typing in the address or using a bookmark): Usually these are: d = 0.9 (1-d) = 0.1

Computing Pagerank Pretend the Web has only four pages: W X Y Z W X Y ZLinks: W  X Y  W Y  Z Z  W W  X Y  W Y  Z Z  W L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 Which page has the highest “quality”?

Computing Pagerank Links: W  X Y  W Y  Z Z  W L(W)=1 L(X)=0 L(Y)=2 L(Z)=1 R(W) = (1-d) + d * (R(Y)/L(Y) + R(Z)/L(Z)) = * (R(Y)/2 + R(Z)/1)) R(X) = * R(W) R(Y) = 0.1 R(Z) = * (R(Y)/2) Now, solve for: R(W), R(X), R(Y), R(Z)

Computing Values for R(W), R(X), R(Y) and R(Z) We could use algebra to find the values, in the same way we could solve for x and y in: x = 1 + 2x + y y = 2 + x + 3y

Algebraic Solution w = R(W) x = R(X) y = R(Y) z = R(Z) w = y + 0.9z x = w y = 0.1 z = y y = 0.1 z = w = x = But solving for eight billion variables is hard. Instead, we'll use fixed point iteration.

Solution by Fixed-Point Iteration Apply equations to compute new estimates: new R(W) = * (R(Y)/2 + R(Z)) = * (1.0/ ) = 1.45 new R(X) = *R(W) = *1.0 = 1.0 new R(Y) = 0.1 new R(Z) = * (R(Y)/2) = * (1.0/2) = 0.55 Start with initial estimates of PageRank for each page: R(W) = 1.0 R(X) = 1.0 R(Y) = 1.0 R(Z) = 1.0

Solution by Fixed-Point Iteration Start with updated estimates: R(W) = 1.45 R(X) = 1.0 R(Y) = 0.1 R(Z) = 0.55 Apply equations to compute new estimates: new R(W) = * (R(Y)/2 + R(Z)) = * (0.1/ ) = 0.64 new R(X) = *R(W) = *1.45 = new R(Y) = 0.1 new R(Z) = * (R(Y)/2) = * (0.1/2) = 0.145

Solution by Iteration iteration R(W) R(X) R(Y) R(Z) Compute new estimates from the old until the estimates stop changing. Note that this is the same answer as the traditional algebraic approach, but this way scales better

Final Pageranks highestpage XR(X) = page WR(W) = page ZR(Z) = lowestpage YR(Y) =

Final Pageranks Y W X Z

How does Google Use Pagerank?  You enter search terms, such as “UNC Charlotte HCI”  Google finds all the pages that have all those words on them  Of all those pages, Google will list the ones with the highest page rank first, but…  …other ‘magic ingredients’ are used by Google: trade secrets of their algorithms.

Introduction  Basic queries are somewhat limited  One or two keywords  Simple relationships  Limited syntax  Complex queries provide more power  Keywords & phrase can be connected to form more complex relationships  Search filters can be employed to limit results

Understanding Boolean Operators

 Syntax  Rules for combining simple words to form complex sentences  Search engine syntax implemented by applying Boolean logic  George Boole 

Understanding Boolean Operators

 Boolean logic  Keywords act as nouns  Boolean operators act as conjunctions  They define the connections between keywords  Illustrated with Venn diagrams  John Venn 

Understanding Boolean Operators W W W All web pages containing the word cats

Understanding Boolean Operators W W W All web pages containing the word dogs

Understanding Boolean Operators W W W All web pages containing the words cats and dogs Intersection of the two sets Searches containing both words

Understanding Boolean Operators W W W All web pages containing the words cats or dogs Searches containing either word Union of the two sets

Understanding Boolean Operators W W W All web pages containing the words cats and not dogs Exclusion of the dogs set Searches containing one word but not the other

Understanding Boolean Operators W W W All web pages containing the words dogs and not cats Exclusion of the cats set Searches containing one word but not the other

Understanding Boolean Operators  Boolean operators  AND  OR  NOT  Instruct the engine on how to combine keywords to produce results  Always use capital letters to avoid confusion with and, or, not as keywords

Understanding Boolean Operators  AND  All these keywords must be on the Web page  OR  These keywords may or may not be on the Web page  At least one of them must be  NOT  None of these keywords can be on the Web page

Understanding Boolean Operators  Default operator  Some engines have a default Boolean operator  Usually AND  Might be OR  Some engines may search for multiple words as phrases

Understanding Boolean Operators  Boolean operators may be  Allowed on main page  Confined to Advanced search pages  Some engines use symbols instead  + for AND  - for NOT  No space between sign and word:  +solar +energy -windmill

Narrowing Searches with AND  AND  Limits results  Forces inclusion of a stop word  Indicates that all keywords must be found on Web page  Adding more ANDed keywords limits search more  Results should be more relevant because the keyword list has expanded

Narrowing Searches with AND  Example:  “solar energy association” AND Portland W W W Solar energy association Portland

Narrowing Searches with AND  Example:  Henry +I same as “Henry I” W W W Henry I

Expanding Searches with OR  OR expands results  Useful if you didn’t get enough returns from your first search  The more keywords you add, the more results you should get  Every page returned must have at least one of the keywords on it  Good to use when you have synonyms

Expanding Searches with OR  Example:  oregon OR northwest W W W oregon northwest

Restricting Queries with AND NOT  AND NOT excludes the keyword that follows NOT  Limits your search  Produces fewer results  Useful if first search returns irrelevant results  Use AND NOT to get rid of those results

Restricting Queries with AND NOT  Equivalent forms:  cats AND NOT dogs  cats AND-NOT dogs  cats NOT dogs  cats –dogs

Restricting Queries with AND NOT  Example:  “solar energy association” AND portland AND NOT maine Solar energy association portland maine

Multiple Boolean Operators  Boolean operators allow you to focus a search  Any logical combination of operators is allowed  If it makes sense when spoken like a sentence it’s probably OK to use  Order of operations is usually left to right  Use parentheses to organize terms

Multiple Boolean Operators  Bad example:  constitution +american OR “united states” constitution american “united states”

Multiple Boolean Operators  Good example:  constitution +(american OR “united states”) constitution american “united states”