LIS618 lecture 5 Thomas Krichel 2004-03-07. structure Google “theory”, mainly page rank Google query language Google special services and features –Images.

Slides:



Advertisements
Similar presentations
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
Advertisements

LIS618 lecture 6 Thomas Krichel structure DIALOG –basic vs additional index –initial database file selection (files) Lexis/Nexis.
LIS650lecture 1 XHTML 1.0 strict Thomas Krichel
Advanced Google Becoming a Power Googler. (c) Thomas T. Kaun 2005 How Google Works PageRank: The number of pages link to any given page. “Importance”
Google for Genealogists. Google's mission statement “Organize the world's information and make it universally accessible and useful."
LIS618 lecture 9 Thomas Krichel Structure Google “theory”, see essay by Brin and Page fullpapers/1921/com1921.htm.
LIS618 lecture 9 Web retrieval Thomas Krichel
Advanced searches in To find the Google Advanced search Google advanced search Type these words into Google search bar.
Important Information This presentation was created by Patrick Crispen. You are free to reuse this presentation provided that you –Not make any money from.
Google Search Using internet search engine as a tool to find information related to creativity & innovation.
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Linda J. Goff - Spring to the Maxto the Maxto the Maxto the Max How to do better searches using your favorite search engine.
LIS618 lecture 9 Google Thomas Krichel
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Chapter 5 Searching for Truth: Locating Information on the WWW.
8/2/2007 Google Search Tips: Advanced Features By Robin Hartman, Associate Librarian Darling Library – Hope International University Adapted from “A Google.
Cutting Through the Clutter Searching the Web. There is a wealth of information waiting for you on the internet, if you know the right tools to use and.
Net Search Engines The Which, Why and How Tim Landeck Handouts/PowerPoint available at:
Important Information This presentation was created by Patrick Crispen. You are free to reuse this presentation provided that you –Not make any money from.
Lesson 12 — The Internet and Research
Searching the Internet Using Google Tips and Tricks.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Searching Google Ms. Mary Condon Librarian Lowell Catholic High School.
ISP IP & URL Browser. Terms b Browser b Search Engine.
Terms b Browser b Search Engine. ISP IP & URL Browser.
Prepared on March 29, 2007 Payam Kabiri, MD. PhD. Epidemiologist GooglingGooglingGooglingGoogling.
Beyond the Basics Steven Butzel, Nashua Public Library , Yahoo IM: nashuaref.
CIS 205—Web Design & Development Dreamweaver Chapter 1.
LIS618 lecture 10 Thomas Krichel Structure some repeats from last week other special syntaxes usenet news in google open directory project.
- prevents a search term to show in results for example searching for doughnut -cream can hel p you to avoid creamy doughnutsdoughnut -cream  “ “  using.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
LIS618 lecture 8 Credo and Gale Thomas Krichel
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
LIS618 lecture 5 Thomas Krichel Structure Theory on query languages Web information retrieval –Google “theory”, see essay by Brin and Page.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Defiana Arnaldy, M.Si GOOGLE HACKING 1.
Search Engines. What is a search engine? Search engines use automated software programs (spider, crawler, robot) to crawl the WWW by following links.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Link: link: restricts the results to those web pages that have links to the specified URL. There can be no space between link: and the URL. Source:
Google This presentation is meant to be a handy tool to help you in your web searches. There is much more in Google than I present here but I hope this.
20 Great google Tips and Tools for Better Research (and Fun) Valerie Koenig
陈贵梧 Chen Gui-wu Search. Outline l Google Overview l Basics of Google Search l Advanced Search Made Easy l Search Results Page l Google Tools l Questions.
A presentation by Patrick Douglas Crispen NetSquirrel.com.
Created by Branden Maglio and Flynn Castellanos Team BFMMA.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
LIS618 lecture 8 Thomas Krichel Lexis/Nexis Lexis is a specialized legal research service Nexis is primarily a news services adds an important.
Selected Internet Search Engines Search Engine Database Advanced/ Boolean Other search options Miscellaneous Google Google google.co m Advanced Search.
Searching the Internet. What is the best search tool?
A presentation by Patrick Douglas Crispen NetSquirrel.com Modified 2013 by Michael Wood.
Google Hacking University of Sunderland CSEM02 Harry R Erwin, PhD Peter Dunne, PhD.
Still Googley After All these Years! Jeff Gentner Technology Facilitator Forest Elementary School, revised Spring 2010.
©2003 Paula Matuszek GOOGLE API l Search requests: submit a query string and a set of parameters to the Google Web APIs service and receive in return a.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
 Every word matters. Generally, all the words you put in the query will be used.  Search is always case insensitive. A search for [ new york times ]
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Searching the Internet
Internet Searching: Finding Quality Information
All About Google’s Search Features
Searching the Internet
The Anatomy of a Large-Scale Hypertextual Web Search Engine
ZANZIBAR UNIVERSITY LIBRARY SERVICES Introduction
Search Search Engines Search Engine Optimization Search Interfaces
ZANZIBAR UNIVERSITY LIBRARY SERVICES Introduction
ZANZIBAR UNIVERSITY LIBRARY SERVICES Introduction
Web Search Engines.
Presentation transcript:

LIS618 lecture 5 Thomas Krichel

structure Google “theory”, mainly page rank Google query language Google special services and features –Images –Groups –ODP

literature Brin and Page “The Anatomy of a Large- Scale Hypertextual Web Search Engine” google.html Calishain and Dornfest's “Google hacks”, O'Reilley 2003 Schneider & alii “How to do everything with Google”, McGraw Hill Osborne, 2004 Google web site

web information retrieval We can think of the web as a pile of documents called pages. Some pages are hard to index –PDF documents some index, some don’t –Pictures –Sound files But a majority of pages are written in HTML –easy to index –have a loose structure

Google uses the structure of HTML Google finds the title of the page, i.e. the contents of the element. Google analysis headings and large font sizes and gives priority weight to terms found there. Most importantly, Google uses the link structure of the web to find important pages.

classic IR and the web In classic information retrieval, every document has the same importance. They differ as to their relevance to a query. In classic information retrieval, a document d is relevant if the query terms appears relatively frequently in d rather than in other documents. But if a web page contains the words “Bill Clinton sucks” and a picture, it is not a good hit for “Bill Clinton”.

Google finds important pages The idea is that the documents on the web have different degrees of importance. Google will show the most important pages first. The ideas is that more important pages are likely to be more relevant to any query than non-important pages.

Google's monkey Imagine that the web has P pages. Each page has its own address (URL). Imagine a monkey who sits at a terminal. He follows links at random, but on rare occasions he gets bored and types in an address of a random page out of those P. Will the monkey visit all pages with equal probability?

page rank Google page rank of a page is the probability that the Google's monkey will visit the page. –The monkey will come frequently to pages that have a lot of links to them. –Once he is there, he will likely go to a page that it linked by one of the pages that an important page links to. The structure of all the links on the entire web reveals the importance of the page.

many page ranks There is an infinite number of ways to calculate the page rank depending on –how likely the monkey gets bored. –the probability of the monkey to visit each page. Potentially, there is a page rank for each user of the web. Google tries to observe users and may be associating personal page ranks.

notation Assume that a monkey gets bored with probability d. If bored, it will visit page p with probability π_p. For any page p, let o_p the number of outgoing links. Let l(p',p) be the number of links from page p' to page p.

page rank formula The page rank for a page p is r_p = π_p d + (1-d) ∑ l(p',p) r_p' / o_p' In words, it is likelihood that, if bored the money goes to the page p plus the likelihood that he gets there from another page p'. The likelihood getting there from p' is the likelihood of being there, times the number of links between p' and p, divided by the number of outgoing links on p'.

example Let there be a web of four pages A B C D A links to B. B links to C. C links to A and D. D links to A. Let the probability to get bored be ¼ and there be a ¼ chance to move to any page when bored.

page ranks The following system calculates the ranks r_A = ¼ ¼ + ¾ (r_C / 2 + r_D) r_B = ¼ ¼ + ¾ r_A r_C = ¼ ¼ + ¾ r_B r_D = ¼ ¼ + ¾ r_C / 2 Since this is fairly complicated, Google uses an iterative approximation to calculate the rank. Note that the sum of all ranks is 1.

interfaces The simple interface has command driven features that make it more advanced than the advanced interface The advanced interface is a form interface to query language available on the simple interface. the toolbar, works with Microsoft Internet Explorer, is the one to get! It has a whole host of features, explore them.

customizing google These are available in the "preferences" link on the Google home page. –preferences for finding pages in a certain language Set language to German and search for “Krichel” –preferences for the language of the interface –SafeSearch, i.e. automatic exclusion of explicit erotic material –number of results per page –open a new window when accessing web pages?

composing the query You should type your question as responded on the web site. –To find the age of N.N. “N.N. born”. –To find the word not likely there in, say “visa application procedure documentation” For queries that are obvious, like “thomas krichel” use the “I’m feeling lucky” button.

query language I default Boolean AND between terms case insensitive terms can be ORed with "OR" or "|" adjacent terms have to be put in double quotes Boolean NOT can be expressed with the exclusion operator – –Example: “krichel –thomas”

query language II * is a wildcard for any word + is the inclusion operator “+stopword” requires the presences of a stop word stopword. But the list of stop words has not been published. In fact it depends from query to query. There is a limit of 10 words, but a * does not count towards the limit

query language III ~ is the similarity operator. It searches for synonyms. Synonyms are gleaned from the web, not from a thesaurus. This seem to include common spelling as well. –example: “~car shop” vs “car shop” " " is used for phrase searching and puts pages where the query terms appear next to each other first.

query treatment Google prefers pages that have the search terms –in close proximity –in the same order as in the query Repeating a query term once adds weight to it repeating it twice has no further effect

spell checking Google makes spelling suggestions. They based on usage of query terms, not on a dictionary. –example: “untied stats” suggests “united states” –example: “beurocratic” But note that these suggestions depend on your interface language.

phone numbers The following are recognized as white page phone number lookups (optional components in []) –business name, city, state –business name, zip code –first name (or initial) last name, [city], [state] –first name (or initial) last name, zip code –first name (or initial) last name, area code works for English interface, US only

maps The following are recognized as map queries –city, [state] –street address, zip code –street address, city, [state], [zip code] Again, US and English interface only

math I you can enter math in plain English words –example: “two times two” –example: “half of eleven” –example: “five megabytes in bytes” –example: “ten gallons in liters” also knows the standard operators “+”, “-”, “*”, “/”,”^“or “**”, “% of”

advanced math “!” factorial “choose”combination without replacement “sqrt”square root “log”logarithm (base 10) “ln”logarithm (base e) “lg”logarithm (base 2) “exp”e to the power of “mod”modulo (remainder) “sin”, “cos”, “tan” “csc”, “sec”, “ctn” “arcsin”, “arccos”, “arctan” “arccsc”, “arcsec”, “arcctn” “sinh”, “cosh”, “tanh”

special syntax I intitle: find in html only –example “intitle:lis618” –example “intitle:"Thomas Krichel"” intext: find in text only. This will exclude occurrences of the search term in anchor or title data. –example: “intext:"miserable failure"“ will not bring up George W. Bush's official biography, as “miserable failure” does, or did at one time.

special syntax II inanchor: This option requests pages, for which there is another page that links to them with the anchor text in the query. –example: “inanchor:"list of my courses"” finds my courses page because it has a link with that text from my homepage. link: apparently returns pages that link to a specific page, but I have no good example

special syntax III cache: pages that are in the google cache, useful if query result has nothing to do with the query terms –cache:openlib.org/home/krichel will show the cached version of the page –cache:wotan.liu.edu/home/krichel is not there, I screwed up If you add further terms, they will be highlighted.

special syntax IV inurl: find in URL only, can use star as a wildcard –example “inurl:list” –example “inurl:*.openlib.org” site: domain of page –example “site:liu.edu” –breaks down if a path is included –can not be used on its one, only with other query expressions

special syntax V daterange: limits the search to pages indexed between a range of dates. Changed pages are reindexed, unchanged pages are not reindexed when the crawler visits a page. dates are expressed in the Julian period, i.e. number of days after :00 UTC of the Julian calendar. This date is used by astronomers. Find a converter with the Google search “julian converter site:nasa.gov” example: “daterange: krichel”

special syntax VI filetype: is in the official documentation to find files of a certain type. in fact in finds files with names ending in the conventional extension for the type. –Adobe Portable Document Format (pdf) and PostScript (ps) –Lotus (wk1, wk2, wk3, wk4, wk5, wki, wks, wku) –Lotus WordPro (lwp) –MacWrite (mw) –Microsoft Excel (xls), PowerPoint (ppt), Word (doc), Works (wks, wps, wdb), Write (wri) –Rich Text Format (rtf) –Text (ans, txt)

special syntax VII info: shows information about a page. The argument to info must be a real existing page that is in the Google index. –example: “info:openlib.org/home/krichel” related: shows pages that Google thinks are related to the page. –example: “related:openlib.org/home/ila”

mixing special syntax expressions The link: syntax does not mix with others. Other bad ideas: –"site:openlib.org –inurl:openlib" –"site:edu site:com" Things that work well –intitle:search –Intitle:biology inurl:help

examples George Bush site:nytimes.com "Copyright * The New York Times" "George Bush" Intitle:"directory * * trees" Botany intitle:"directory of" site:edu "powered by blogger" or site:blogspot.com "classical music" (inurl:mailman | inurl:listserv) google special syntax –site:google.com

phonebook: special syntax A location seems to be required, i.e. phonebook: long island university ny no –wildcards –exclusions –or there is also – rphonebook for residential – bphonebook for businesses

stocks on google stocks: ticker will look up a ticker symbol ticker at you can find ticker symbols there ticker symbols are useful to find financial information about publicly traded companies. example: “stocks:msft”

google images special syntax intitle: searches for images with a given string in the file name –example: “intitle:novosibirsk” inurl: searches for images in pages that have a certain url –example: “inurl:liu.edu” site: restricts the search to a certain site. It should be combined with a search term like –example “site:liu.edu koenig”

Thank you for your attention!