Download presentation
Presentation is loading. Please wait.
Published byFrancis Lucas Modified over 7 years ago
1
CSCE 590 Web Scraping – Information Extraction II
Topics Information Retrieval framework revisited Readings: Scrapy User manual – March 16, 2017
2
Figure 23.1 Google as an IR engine
3
Figure 23.2 Architecture of an IR system
4
Slide from Speech and Language Processing -- Jurafsky and Martin
Google PageRank Slide from Speech and Language Processing -- Jurafsky and Martin
5
Google PageRank continued
“ PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves "important" weigh more heavily and help to make other pages "important". Slide from Speech and Language Processing -- Jurafsky and Martin
6
False or spoofed PageRank
This spoofing technique, also known as 302 Google Jacking, was a known failing or bug in the system. Any page's PageRank could have been spoofed to a higher or lower number of the webmaster's choice and only Google has access to the real PageRank of the page. Spoofing is generally detected by running a Google search for a URL with questionable PageRank, as the results will display the URL of an entirely different site (the one redirected to) in its results. Slide from Speech and Language Processing -- Jurafsky and Martin
7
Wolfram Alpha n-grams it was the best of times it was the worst of times Who was the president in 1888? What is pi? graphs on 11 vertices *Who was Grover Cleveland's vice-president?
8
Bing http://www.bing.com
10
Vector Space Model Vector Space Model - Represents the terms that occur within the collection Term weights In a document if keyword1 occurs 4 times keyword2 occurs 7 times keyword3 occurs 0 times … keywordn occurs 3 times Then the vector representing the document is (4,7,0, …3)
11
Slide from Speech and Language Processing -- Jurafsky and Martin
Sim(query, document) . Dot-product Vector length Term-by-document matrix Slide from Speech and Language Processing -- Jurafsky and Martin
12
Figure 23.3 Visualization of the vector model
13
Inverse Document Frequency
Term Weighting Document importance Inverse Document Frequency (Spark Jones 1972) “Assign higher weights to more discriminative words” Where N = total # of documents and ni = # documents that contain term i Tf-idf weighting (term-frequency x idf) Slide from Speech and Language Processing -- Jurafsky and Martin
14
Tf-idf weighted cosine
Also used in summarization (page 794) Other topics: Stemming Stop-list
15
Figure 23.4 Rank-Specific P and R for list of docs
16
Figure 23.5 Interpolated Precision
17
Figure 23.6
18
Ways to improve User Queries
Relevance feedback Query expansion thesaurus
19
Figure 23.7 Factoid Question Answering
Wolfram-Alpha Bing Google Where is the Louvre? 10 10+map What is the abbreviation for limited partnership? - 9 What are the names of Odin’s ravens? What currency is used in China? what kind of nuts are used in marzipan? What is the official language of Algeria? What is the telephone number of the University of South Carolina? 7 SCState How many pounds are in a stone?
20
Figure 23.8 Architecture
21
Figure 23.9 Question Typology
22
Figure 23.9 continued
23
Figure 23.10
24
Figure 23.11
25
Figure 23.12
26
Figure 23.13 Example Summarization
27
Figure 23.14
28
Figure 23.15
29
Figure 23.16
30
Figure 23.17 Features used in Supervised Classifiers
31
Figure 23.18
32
Figure 23.19 Summarization Tagging
Subject(S) Object(O) Oblique(X)
33
Figure 23.20 Rewriting References
34
Figure Examples
35
Figure 23.22
36
Google translate What is the number of graphs on 10 vertices? Translation: English » Arabic ما هو عدد من الرسوم البيانية في 10 القمم؟ Translation: Arabic » English What is the number of charts in the 10 top?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.