Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach

Similar presentations


Presentation on theme: "Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach"— Presentation transcript:

1 Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach http://www.youtube.com/watch?v=JFyNzrxQCus http://netsquirrel.com/crispen/about_crispen.html

2 Our Goals Learn how Google really works. Discover some Google secrets no one ever tells you. Play around with some of Google’s advanced search operators. Find out where to get more Google- related help and information. DO ALL OF THIS IN ENGLISH!

3 Part One: How Google REALLY Works Or, at least, how I think Google really works.

4 One Word of Warning For obvious reasons, the folks at Google would rather the Wizard of Oz stay behind the curtain, so to speak. So, what you are about to see on the next few slides are just plain guesses on my part. And, my guesses are probably completely wrong! But they’re pretty. And that’s all that matters.

5 How Google Works - Phrases When you search for multiple keywords, Google first searches for all of your keywords as a phrase. I think. So, if your keywords are disney fantasyland pirates, any pages on which those words appear as a phrase receive a score of X. Image source: Google Source: Google Hacks, p. 21

6 How Google Works - Adjacency Google then measures the adjacency between your keywords and gives those pages a score of Y. What does this mean in English? Well … Image source: Google Source: Google Hacks, p. 21

7 How Adjacency Works A page that says “My favorite Disney attraction, outside of Fantasyland, is Pirates of the Caribbean” will receive a higher adjacency score than a page that says “Walt Disney was a both a genius and a taskmaster. The team at WDI spent many sleepless nights designing Fantasyland. But nothing could compare to the amount of Imagineering work required to create Pirates of the Caribbean.”

8 How Google Works - Weights Then, Google measures the number of times your keywords appear on the page (the keywords’ “weights”) and gives those pages a score of Z. A page that has the word disney four times, fantasyland three times, and pirates seven times would receive a higher weights score than a page that only has those words once. Source: Google Hacks, p. 21

9 Putting it All Together Google takes –The phrase hits (the Xs), –The adjacency hits (the Ys), –The weights hits (the Zs), and –About 100 other secret variables Throws out everything but the top 2,000 Multiplies each remaining page’s individual score by it’s “PageRank” And, finally, displays the top 1,000 in order.

10 PageRank? There is a premise in higher education that the importance of a research paper can be judged by the number of citations the paper has from other research papers. Google simply applies this premise to the Web: the importance of a Web page can be judged by the number of hyperlinks pointing to it from other pages. Or, to put it mathematically [brace yourself – the next slide contains the intimidating-looking equation I warned you about] … Source: Google Hacks, p. 294

11 The PageRank Algorithm Where PR(A) is the PageRank of Page A PR(T1) is the PageRank of page T1 C(T1) is the number of outgoing links from the page T1 d is a damping factor in the range of 0 < d < 1, usually set to 0.85 Source: Google Hacks, p. 295

12 You Can Start Breathing Again I promise there are no more equations in this presentation. I just wanted to show you that the PageRank of a Web page is the sum of the PageRanks of all the pages linking to it divided by the number of links on each of those pages. –A page with a lot of (incoming) links to it is deemed to be more important than a page with only a few links to it. –A page with few (outgoing) links to other pages is deemed to be more important than a page with links to lots of other pages. Source: Google Hacks, p. 295

13 Part One: In Summary Google first searches for your keywords as a phrase and gives those hits a score of X. Google then searches for keyword adjacency and gives those hits a score of Y. Google then looks for keyword weights and gives those hits a score of Z. Google combines the Xs, the Ys, the Zs, and a whole bunch of unknown variables, and then weeds out all but the top 2,000 scores. Finally, Google takes the top 2,000 scores, multiplies each by their respective PageRank, and displays the top 1,000. I think.

14 Part Two: More Stuff No One Tells You Google’s shocking secrets revealed!

15 Many keywords and phrases You can search for several keywords To search for phrases, just put your phrase in quotes. For example, disney fantasyland “pirates of the caribbean” –This would show you all the pages in Google’s index that contain the word disney AND the word fantasyland AND the phrase pirates of the caribbean (without the quotes) –And then pages that include 2 (or 1) of these By the way, while this search is technically perfect, my choice of keywords contains a (deliberate) factual mistake. Can you spot it? Source: http://www.google.com/help/refinesearch.html

16 Arr, She Blows! Pirates of the Caribbean isn’t in Fantasyland, it’s in Adventureland in Orlando and Paris, and New Orleans Square in Anaheim. So searching for disney AND fantasyland AND “pirates of the caribbean” probably isn’t a good idea. YOU have to select sensible search terms! Image source: http://www.balgavy.at/

17 How Insensitive! Google is not case sensitive. So, the following searches all yield exactly the same results: –disney fantasyland pirates –Disney Fantasyland Pirates –DISNEY FANTASYLAND PIRATES –DiSnEy FaNtAsYlAnD pIrAtEs Source: http://www.google.com/help/basics.html

18 Google’s 10 Word Limit Google didn’t accept more than 10 keywords at a time. Any keyword past 10 was ignored. Why search for a long list of words? (Baroni & Bernardini 2004) wanted to find new Italian medical terms. They had a list of 100+ known medical terms; used Google to find docs with some of these – then examined the new words in these docs…Baroni & Bernardini 2004

19 Searching for more than 10 How could you get around this limit? –Stopwords –Wildcards –BootCat and Google API

20 Stop Words To enhance the speed and relevancy of your Web search, Google routinely and automatically ignores common words and characters known as “stop words.” Source: http://www.google.com/press/guide/reviewguide_7.html

21 Stop _ _ Name _ Love This is certainly not a canonical list, but here are 28 stop words I know about. a, about, an, and, are, as, at, be, by, from, how, i, in, is, it, of, on, or, that, the, this, to, we, what, when, where, which, with You can force Google to search for a stop word by putting a + in front of it, for example pirates +of +the caribbean Compare knowledge of managementknowledge of management Knowledge +of management Source: 10/23/02 post by Bill Todd to news:google.public.support.general

22 Dealing with the 10 Word Limit Omit the stop words in your search terms and you’ll probably never run into the 10 word limit. 2 other ways around the limit: use wildcards; use BootCat and Google API Image source: http://www.alloyd.com/

23 Google and Wildcards Google doesn’t support stemming. Rather, Google offers full-word wildcards. For example, if you search Google for it’s +a * world, Google shows you all of the pages in its database that contain the phrase “it’s a small world” … and “it’s a nano world” … and “it’s a Linux world” … and so on. Source: Google Hacks, p. 37

24 it’s +a * world The + before a is required because it is a stop word and would otherwise be ignored. Most of the hits are phrases because that’s what Google looks for first. Oh, and I defy you to get that song out of your head! Image source: http://themeparksource.com/

25 Wildcards and the Word Limit Remember when I said that one way to get around the 10 word limit was to use wildcards? Google doesn’t count wildcards toward the limit. For example, Google thinks that though * mountains divide * * oceans * wide it's * small world after all is exactly 10 words long. Source: Google Hacks, p. 19

26 BootCat and Google API How can I find documents with (some of) a LONG list of keywords? E.g. a list of medical terms: docs with (some of) these may include NEW medical terms I didn’t know BootCat (Baroni & Bernardini 2004) generates random sub-lists (up to 10 keywords from the list), calls Google API to find docs, downloads to collect a CORPUS (Web-as-Corpus research) BootCatBaroni & Bernardini 2004

27 The Order of Your Keywords Matters When you conduct a search at Google, it searches for –Phrases, then –Adjacency, then –Weights. Because Google searches for phrases first, the order of your keywords matters. Image source: Google Source: Google Hacks, p. 20-22

28 For Example A search for disney fantasyland pirates yields the same number of hits as a search for fantasyland disney pirates, but the order of those hits – especially the first 10 – is noticeably different.

29 Part Two: In Summary You can search with several keywords. Capitalization does not matter. Google has a hard limit of 10 keywords PER QUERY. Google ignores a BUNCH of common words – STOPLIST. Google does support wildcard searches … sort of. BootCat takes a LONG list of keywords and sends subsets to Google API, to gather a CORPUS of matching web-pages. The order of your keywords matters.

30 Part Three: Advanced Search Operators Beyond plusses, minuses, ANDs, ORs, quotes, and *s

31 filetype: filetype: restricts your results to files ending in ".doc" (or.xls,.ppt. etc.), and shows you only files created with the corresponding program. There can be no space between filetype: and the file extension The “dot” in the file extension –.doc – is optional. Source: http://www.google.com/help/faq_filetypes.html

32 filetype:extension pirates filetype:pdf pirates -filetype:pdf

33 intitle: intitle: restricts the results to documents containing a particular word in its title. There can be no space between intitle: and the following word. You can also search for phrases. Just put your phrase in quotes. Source: http://www.google.com/help/operators.html

34 Title? Pirates of the Caribbean...

35 intitle:terms intitle:pirates intitle:”knowledge management”

36 A Quick Question What would happen if I searched for intitle:walt disney (without the quotes)? Google would look for every page with the world walt in its title AND the word disney somewhere in its body. Remember, the quotes are kind of important if you want to search for phrases using intitle:

37 How Google Works - intitle If your search keywords are in a Title, this means the section and document is more likely to be “good”,.i.e. On the right topic So: this is another factor in the Google score of a page: the score is boosted if the keywords are in a Title

38 site: site: restricts the results to those websites in a domain. There can be no space between site: and the domain. Source: http://www.google.com/help/operators.html

39 site:domain pirates site:disney.com pirates site:uk Pirates site:cn pirates site:www.comp.leeds.ac.uk ramadan site:www.comp.leeds.ac.uk/nora/html

40 Some more operators Query modifiers daterange: filetype: inanchor: intext: intitle: inurl: site: Alternative query types cache: link: related: info: Other information needs phonebook: stocks:

41 inanchor: inanchor: restricts the results to documents containing a particular word in an anchor. There can be no space between intitle: and the following word. You can also search for phrases. Just put your phrase in quotes. Source: http://www.google.com/help/operators.html

42 Anchor? Pirates of the Caribbean...

43 inanchor:terms inanchor:pirates inanchor:”knowledge management”

44 How Google Works - inanchor If your search keywords are in an Anchor, this means the section and document is more likely to be “good”,.i.e. On the right topic So: this is another factor in the Google score of a page: the score is boosted if the keywords are in an Anchor

45 How Google Works – the page inanchor points to... If your search keywords are in an Anchor, this means the OTHER document it POINTS TO is more likely to be “good”,.i.e. On the right topic So: this is another factor in the Google score of a page: the score of the WEB-PAGE POINTED TO is boosted if the keywords are in an Anchor

46 More Google Services http://www.google.com/ And see “more”… e.g. http://code.google.com/ Open-source code, API (e.g. Leeds FYPs!) http://labs.google.com/ Latest research “graduates” include: http://scholar.google.com/ Find research papers (e.g. background survey) http://books.google.com/ Find quotes in books http://maps.google.com/ Maps, routefinder…

47 http://www.google.com/help/ Google Help Central Free guides and FAQs that tell you about Web searching in general and Google’s features in specific.

48 How Google Spends Its Time and Resources 70% Core: Search and Ads –Examples: Crawling, Ranking, AdWords, Toolbar, AdSense 20% Related: Extensions of Core Search –Examples: News, Froogle, GSA, Desktop, Local, Gmail and other communication projects 10% Exploratory –Examples: Picasa, Keyhole, Orkut Source: Google Factory Tour

49 Part 4: more about search

50 “I’m feeling Lucky” The “I’m Feeling Lucky” button takes you directly to the first web page Google returns for your query. You won’t see any other search results. Source: http://www.google.com/help/features.html#lucky

51 Google bombing “Google bombing” is an attempt to influence a certain page’s Google ranking. If enough people create web pages that use the same anchor text to point to the same web page [for example, if several hundred web pages linked the phrase “cow poly” to www.auburn.edu], you can force that page to become Google’s first hit. And “I’m Feeling Lucky” automatically takes you to that first hit.

52 Examples of Google bombs Three examples: –Failure –Great President –French Military Victories Is this Google’s fault? NO! –Google bombs AREN’T editorial statements by Google. –People are just “gaming” PageRank. –“Fixing” this would be a slippery slope.

53 Advanced Search options To the right of the search box are three links practically no one has never noticed: –Advanced Search –Preferences –Language Tools

54 Advanced Search: Other Options You [should] already know how to do file format, occurrences, and domain searches from the Google homepage. Most of the rest are self-explanatory.

55 Usage Rights – No Filter Aren’t filtered by license = “show me everything.” This is a default Google search.

56 Usage Rights – Reuse Filter Allow some form of reuse = “show me stuff I can reuse with restrictions.” –You must attribute the work. –You cannot use the work for commercial purposes. See http://tinyurl.com/b245b for more information.

57 Usage Rights – Freely Modify Can be freely modified, adapted, or built upon = “stuff I can reuse with attribution.” See http://tinyurl.com/ dtuu3 for more information.

58 Advanced Search v. Preferences Advanced Search = “search once using these settings.” Preferences = –“Change the way Google works for me from here on out.” –Changes every Google service you use, not just search.

59 Google Preferences When you change your Google preferences, Google writes a cookie to your hard drive. Your Google preferences are “permanent” until you: –Change your preferences. –Toss your cookies. In Internet Explorer: Tools > Options > Delete Cookies In Mozilla/Firefox: Tools > Options > Privacy > Clear Cookies. –Go to http://www.google.com. The extra period at the end forces you to go to the English language version of Google. Source: Google Hacks 2 nd Ed, p. 21

60 Interface Language Interface Language lets you change the default language used to display the interface of every Google page you visit. Change the Interface Language to Chinese (Traditional), save your preferences, and watch what happens…

61

62

63 Interface Language Limitations Notice the hits are still in English. –Google doesn’t translate the hits to your default language. Yet. The only thing that’s changed is the default language of Google’s interface.

64 Using Interface Language This is great for foreign language immersion. This is also a WONDERFUL practical joke to play on a friend or colleague. –“Hey, why is Google in LATIN!?” Remember, your Google preferences are “permanent” until you: –Change your preferences. –Toss your cookies. –Go to http://www.google.com.

65 Preferences: Search Language (Sharoff 2006) used Google API to collect 100M word corpora of English, Chinese, French, German, Italian, Japanese, Spanish, Polish and Russian, see http://corpus.leeds.ac.uk/internet.html http://corpus.leeds.ac.uk/internet.html

66 Preferences: SafeSearch Filtering Google's SafeSearch Filtering screens for sites that contain explicit sexual content and deletes them from your search results. By default, it only filters explicit images. To filter both images and text, choose “Use strict filtering.” Source: http://www.google.com/help/customize.html#safe

67 More Preferences Number of Results and Results Window are self- explanatory. Remember, your Google preferences are “permanent” until you: –Change your preferences. –Toss your cookies.

68 Wait. There’s More!

69 Language Tools Like Advanced Search, Language Tools is a one- shot deal. Use Language Tools if –You don’t want to permanently change your Interface or Search languages. –You want to translate text.

70 Google Translate Using Language Tools, you can –Translate keyed in text from one language to another. –Translate a web page’s text from one language to another. Be looking for even more robust translation tools from Google in the not-too-distant future.

71 Part 4: In Summary The “I’m Feeling Lucky” button takes you directly to the first web page Google returns for your query. Advanced Search and Language Tools are one-shot deals. Preferences are permanent and global until you change them or delete your cookies.

72 Part 5: other searches when Google’s spiders report back, they send Google a complete copy of everything they find – HTML, text, images, etc. Google’s web search gets all of the attention – it consumes 70% of Google’s time and energy. But why not make the other stuff the spiders find searchable as well?

73 Google Image Search

74 Behind the Scenes “Hey, let’s take all these cached images and make them searchable.” Search on the TEXT that comes with images (ALT tags, …) – NOT “image recognition” Two ways to get to Google Image Search –images.google.com –Go to Google [or Google Groups, Google News, Froogle, or Google Local/Google Maps] and click on the “Images” link.

75 Using Google Image Search Search engine math works here as well. Search on nearby text – a “mini document” Check out Advanced Image Search [to the right of the search box] for special size, filetype, and coloration options. Beware of copyright! –Google cannot grant you any rights to use the images you find for any purpose other than viewing them on the web. –To reuse the images, contact the site owner and obtain the requisite permissions.

76 Google News

77 Behind the Scenes “Hey, let’s take news articles from 4,500 online news sites and make them searchable.” Two ways to get to Google News –news.google.com –Go to Google [or Google Groups, Google News, Froogle, or Google Local/Google Maps] and click on the “News” link.

78 How Google News Works Every 15 minutes, Google gathers stories from more than 4,500 English-language news sites. A computer program automatically arranges the stories by relevance and popularity. –Sound familiar? [*cough* PageRank *cough*] –There are no editors or human intervention. –Google’s algorithms run everything. And if you don’t want to browse the news, you can also search the news by keyword[s.] Source: http://news.google.com/intl/en_us/about_google_news.html

79 Froogle

80 Behind the Scenes “Hey, let’s take all these cached web pages on which merchants are trying to sell stuff and make those pages searchable.” Three ways to get to Froogle –froogle.com –http://www.google.com/products –Go to Google [or Google Groups, Google News, Froogle, or Google Local/Google Maps] and click on the “Froogle” link.

81 How Froogle Works Adapted from: http://froogle.google.com/froogle/tour/index.html?promo=help

82 The Hidden 20% [and 10% More]

83

84

85 googlesightseeing.com

86 Peter Norvig, Google Research Youtube video: Peter Norvig, Director of Research, Google Google Developers Day: Theorizing from Data Covers Google tools beyond keyword-search Statistical language models behind the tools Video is 37 minutes long, plus questions from the audience (skip these?) http://www.youtube.com/watch?v=nU8DcBF-qo4

87 Key topics to watch out for Probabilistic, statistical (Markov or N-gram) models More Data is better than good algorithms (Banko & Brill) Google N-gram Corpus: WWW freqs of words, word- pairs, 3-grams, 4-grams, 5-grams … New Google tools: Google Sets, Google Trends, … Question-Answering: how to extract facts from WWW Statistical Machine Translation, ML from aligned texts How many bits are needed to store probabilities? (4!!) To save space, truncate words – how many letters? (4!!) Google AI tools don’t try to replace human intelligence – they AUGMENT intelligence

88 Theorizing from Data http://www.youtube.com/watch?v=nU8DcBF-qo4

89 Our Goals Learn how Google really works. Discover some Google secrets no one ever tells you. Play around with some of Google’s advanced search operators. Find out where to get more Google- related help and information. DO ALL OF THIS IN ENGLISH!


Download ppt "Advanced Googolgy Based on a presentation by Patrick Douglas Crispen California State University, Long Beach"

Similar presentations


Ads by Google