Web Programming Week 14 Old Dominion University

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
DAT702.  Standard Query Language  Ability to access and manipulate databases ◦ Retrieve data ◦ Insert, delete, update records ◦ Create and set permissions.
Databases & Data Warehouses Chapter 3 Database Processing.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
1 Copyright 2006 MySQL AB The World’s Most Popular Open Source Database Full Text Search in MySQL 5.1 New Features and HowTo Alexander Rubin Senior Consultant,
MySQL Dr. Hsiang-Fu Yu National Taipei University of Education
Web Programming Week 13 Old Dominion University Department of Computer Science CS 418/518 Fall 2010 Martin Klein 11/23/10.
Dbwebsites 2.1 Making Database backed Websites Session 2 The SQL… Where do we put the data?
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Chapter 4 Introduction to MySQL. MySQL “the world’s most popular open-source database application” “commonly used with PHP”
Introduction to Internet Databases MySQL Database System Database Systems.
CSC 2720 Building Web Applications Database and SQL.
1 Structured Query Language (SQL). 2 Contents SQL – I SQL – II SQL – III SQL – IV.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CS146 References: ORACLE 9i PROGRAMMING A Primer Rajshekhar Sunderraman
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Information Retrieval
1 CS 430 Database Theory Winter 2005 Lecture 10: Introduction to SQL.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.
Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06.
CS315 Introduction to Information Retrieval Boolean Search 1.
Lecture 1.21 SQL Introduction Steven Jones, Genome Sciences Centre.
Decision Analysis Fall Term 2015 Marymount University School of Business Administration Professor Suydam Week 10 Access Basics – Tutorial B; Introduction.
Notice: MySQL is a registered trademark of Sun Microsystems, Inc. MySQL Conference & Expo 2011 Michael “Monty” Widenius Oleksandr “Sanja”
N5 Databases Notes Information Systems Design & Development: Structures and links.
Large Scale Search: Inverted Index, etc.
Practical Office 2007 Chapter 10
Instructor: Jason Carter
Text Based Information Retrieval
Database application MySQL Database and PhpMyAdmin
Web Programming Week 3 Old Dominion University
Structured Query Language (SQL) William Klingelsmith
Search Techniques and Advanced tools for Researchers
Chapter 8 Working with Databases and MySQL
Thanks to Bill Arms, Marti Hearst
Introduction to Digital Libraries Assignment #1
MySQL Dr. Hsiang-Fu Yu National Taipei University of Education
موضوع پروژه : بازیابی اطلاعات Information Retrieval
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Web Programming Week 14 Old Dominion University
Web Programming Week 3 Old Dominion University
Peer-to-Peer Information Systems Week 6: Assignment #4
Introduction to Databases & SQL
Introduction to Information Retrieval Assignment #3
Search Engine Architecture
Chapter 4 Introduction to MySQL.
Information Retrieval and Web Design
Web Programming Week 3 Old Dominion University
Peer-to-Peer Information Systems Week 6: Assignment #4
Introduction to Digital Libraries Assignment #3
Introduction to information retrieval
Database Instructor: Bei Kang.
Introduction to Digital Libraries Assignment #1
Information Retrieval
Introduction to Digital Libraries Assignment #1
Introduction to Digital Libraries Assignment #4
Presentation transcript:

Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2007 Michael L. Nelson <mln@cs.odu.edu> 11/26/07

Relational Data Model is a Special Case… SELECT name, catches, yards, touchdowns FROM VT_Boxscores, VT_Roster WHERE game_id = “12” AND number = “4” AND year = “2006”;

Unstructured Data is More Common…

Precision and Recall Precision Recall “ratio of the number of relevant documents retrieved over the total number of documents retrieved” (p. 10) how much extra stuff did you get? Recall “ratio of relevant documents retrieved for a given query over the number of relevant documents for that query in the database” (p. 10) note: assumes a priori knowledge of the denominator! how much did you miss?

Precision and Recall 1 Precision figure 1.2 in FBY 1 Recall

LIKE & REGEXP We can search rows with the “LIKE” (or “REGEXP”) operator http://dev.mysql.com/doc/refman/5.0/en/pattern-matching.html for tables of any size, this will be s-l-o-w there is a better way… mysql> SELECT id, name FROM VT_Roster WHERE name LIKE ‘Se%’ -> AND year=‘2006’); +----+---------------+ | id | name | | 7 | Sean Glennon | | 70 | Sergio Render | 2 rows in set (0.00 sec)

CREATE Table mysql> CREATE TABLE recaps ( -> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, -> title VARCHAR(200), -> body TEXT, -> FULLTEXT (title,body) -> ); Query OK, 0 rows affected (0.00 sec) can only create FULLTEXT on CHAR, VARCHAR or TEXT columns “title” and “body” still available as regular columns if you want to search only on “title”, you need to create a separate index

INSERT mysql> INSERT INTO recaps (title,body) VALUES -> ('Hokies Blank UVa', '#17 Hokies ended the season ...'), -> ('Hokies Put Wake in Their Place', 'Sean Glennon threw for ...'), -> ('Hokies Blank Kent State', 'Virgina Tech overcame a sloppy ...'); Query OK, 3 rows affected (0.00 sec) Records: 3 Duplicates: 0 Warnings: 0

MATCH .. AGAINST why?! mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’sloppy'); +----+-------------------------+------------------------------------------+ | id | title | body | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy ... | 1 row in set (0.00 sec) mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Hokies'); +----+-------------------------+------------------------------------------+ | id | title | body | 0 rows in set (0.00 sec) why?!

Ranking If you are not in Boolean mode and the word appears in > 50% of the rows, then the word is considered a “stop word” and is not matched This makes sense for large collections (the word is not a good discriminator of records), but can lead to unexpected results for small collections example: “Hokies” would be considered a stopword at techsideline.com (because it appears in every game recap), but would not be a stopword at sports.yahoo.com/ncaaf (because they cover all of college football)

Stopwords Stopwords exist in stoplists or negative dictionaries Idea: remove low semantic content index should only have “important stuff” What not to index is domain dependent, but often includes: “small” words: a, and, the, but, of, an, very, etc. NASA ADS example: http://adsabs.harvard.edu/abs_doc/stopwords.html MySQL full-text index: http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html

Stopwords Punctuation, numbers often stripped or treated as stopwords precision suffers on searches for: NASA TM-3389 F-15 X.500 .NET Tree::Suffix MySQL also treats words < 4 characters as stopwords too bad for: “Liu”, “CFD”, “Ada”, etc.

Getting the Rank mysql> SELECT id, MATCH (title,body) AGAINST (’Sewell') -> FROM recaps; +----+-----------------------------------------+ | id | MATCH (title,body) AGAINST (’Sewel') | | 1 | 0.65545833110809 | | 2 | 0 | | 3 | 0 | 3 rows in set (0.00 sec)

Boolean Mode Does not use the 50% threshold mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’+Hokies’ IN BOOLEAN MODE); +----+-------------------------+------------------------------------------+ | id | title | body | | 1 | Hokies Blank UVa | #17 Hokies ended the season ... | | 2 | Hokies Put Wake in ... | Sean Glennon threw for ... | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy ... | 3 rows in set (0.00 sec) Does not use the 50% threshold Does use stopwords, length limitation Operator list: http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html

Blind Query Expansion (AKA Automatic Relevance Feedback) How does one keep up with Virginia Tech’s multiple names / nicknames? Hokies, Fighting Gobblers, VPI, VPI&SU, Va Tech, VT Idea: run the query with the requested terms, then take the results and re-run the query with the most relevant terms from the initial results increases recall; decreases precision mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech'); +----+------------------------+------------------------------------------+ | id | title | body | | 3 | Hokies Blank Kent State| Virginia Tech overcame a sloppy ... | 1 rows in set (0.00 sec) in this example, pretend “Virginia Tech” did not appear in the game recaps and that “Hokies” appears in > 50% of rows mysql> SELECT * FROM recaps -> WHERE MATCH (title,body) AGAINST (’Virginia Tech’ WITH QUERY EXPANSION); +----+-------------------------+------------------------------------------+ | id | title | body | | 1 | Hokies Blank UVa | #17 Hokies ended the season ... | | 2 | Hokies Put Wake in ... | Sean Glennon threw for ... | | 3 | Hokies Blank Kent State | Virginia Tech overcame a sloppy ... | 3 rows in set (0.00 sec)

For More Information… MySQL documentation: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html Chapter 12/13 “Building a Content Management System” CS 751/851 “Introduction to Digital Libraries” http://www.cs.odu.edu/~mln/teaching/ esp. “Information Retrieval Concepts” lecture Introduction to Information Retrieval textbook http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html Is MySQL the right tool for your job? http://lucene.apache.org/ MySQL examples in this lecture based on those found at dev.mysql.com content snippets taken from www.techsideline.com