Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Programming Week 14 Old Dominion University

Similar presentations


Presentation on theme: "Web Programming Week 14 Old Dominion University"— Presentation transcript:

1 Web Programming Week 14 Old Dominion University
Department of Computer Science CS 418/518 Fall 2008 Michael L. Nelson 11/24/08

2 Relational Data Model is a Special Case…
SELECT name, catches, yards, touchdowns FROM VT_Boxscores, VT_Roster WHERE game_id = “12” AND number = “4” AND year = “2006”;

3 Unstructured Data is More Common…

4 Precision and Recall Precision Recall
“ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) how much extra stuff did you get? Recall “ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! how much did you miss?

5 Precision and Recall 1 Precision figure 1.2 in FBY 1 Recall

6 Why Isn’t Recall Always 100%?
Virginia Agricultural and Mechanical College? Virginia Agricultural and Mechanical College and Polytechnic Institute? Virginia Polytechnic Institute? Virginia Polytechnic Institute and State University? Virginia Tech?

7 LIKE & REGEXP We can search rows with the “LIKE” (or “REGEXP”) operator for tables of any size, this will be s-l-o-w there is a better way… mysql> SELECT id, name FROM VT_Roster WHERE name LIKE ‘Se%’ -> AND year=‘2006’); | id | name | | 7 | Sean Glennon | | 70 | Sergio Render | 2 rows in set (0.00 sec)

8 CREATE Table mysql> CREATE TABLE recaps (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) can only create FULLTEXT on CHAR, VARCHAR or TEXT columns “title” and “body” still available as regular columns if you want to search only on “title”, you need to create a separate index

9 INSERT mysql> INSERT INTO recaps (title,body) VALUES
-> ('Hokies Blank UVa', '#17 Hokies ended the season ...'), -> ('Hokies Put Wake in Their Place', 'Sean Glennon threw for ...'), -> ('Hokies Blank Kent State', 'Virgina Tech overcame a sloppy ...'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0

10 MATCH .. AGAINST why?! mysql> SELECT * FROM recaps
-> WHERE MATCH (title,body) AGAINST (’sloppy'); | id | title | body | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy | 1 row in set (0.00 sec) mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Hokies'); | id | title | body | 0 rows in set (0.00 sec) why?!

11 Ranking If you are not in Boolean mode and the word appears in > 50% of the rows, then the word is considered a “stop word” and is not matched this makes sense for large collections (the word is not a good discriminator of records), but can lead to unexpected results for small collections

12 Stopwords Stopwords exist in stoplists or negative dictionaries
Idea: remove low semantic content index should only have “important stuff” What not to index is domain dependent, but often includes: “small” words: a, and, the, but, of, an, very, etc. NASA ADS example: MySQL full-text index:

13 Stopwords Punctuation, numbers often stripped or treated as stopwords
precision suffers on searches for: NASA TM-3389 F-15 X.500 .NET Tree::Suffix MySQL also treats words < 4 characters as stopwords too bad for: “Liu”, “CFD”, “Ada”, etc.

14 Getting the Rank mysql> SELECT id, MATCH (title,body) AGAINST (’Sewell') -> FROM recaps; | id | MATCH (title,body) AGAINST (’Sewell') | | 1 | | | 2 | | | 3 | | 3 rows in set (0.00 sec)

15 Boolean Mode Does not use the 50% threshold
mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’+Hokies’ IN BOOLEAN MODE); | id | title | body | | 1 | Hokies Blank UVa | #17 Hokies ended the season | | 2 | Hokies Put Wake in ... | Sean Glennon threw for | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy | 3 rows in set (0.00 sec) Does not use the 50% threshold Does use stopwords, length limitation Operator list:

16 Blind Query Expansion (AKA Automatic Relevance Feedback)
How does one keep up with Virginia Tech’s multiple names / nicknames? Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT Idea: run the query with the requested terms, then take the results and re-run the query with the most relevant terms from the initial results mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech'); | id | title | body | | 3 | Hokies Blank Kent State| Virginia Tech overcame a sloppy | 1 rows in set (0.00 sec) in this example, pretend “Virginia Tech” did not appear in game recaps 1 & 2 and that “Hokies” appears in > 50% of rowss mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech’ WITH QUERY EXPANSION); | id | title | body | | 1 | Hokies Blank UVa | #17 Hokies ended the season | | 2 | Hokies Put Wake in ... | Sean Glennon threw for | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy | 3 rows in set (0.00 sec)

17 For More Information… MySQL documentation:
Chapter 12/13 “Building a Content Management System” CS 751/851 “Introduction to Digital Libraries” esp. “Information Retrieval Concepts” lecture Is MySQL the right tool for your job? MySQL examples in this lecture based on those found at dev.mysql.com content snippets taken from


Download ppt "Web Programming Week 14 Old Dominion University"

Similar presentations


Ads by Google