IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

Slides:



Advertisements
Similar presentations
The Relational Model and Relational Algebra Nothing is so practical as a good theory Kurt Lewin, 1945.
Advertisements

Relational Database. Relational database: a set of relations Relation: made up of 2 parts: − Schema : specifies the name of relations, plus name and type.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
SQL Lecture 10 Inst: Haya Sammaneh. Example Instance of Students Relation  Cardinality = 3, degree = 5, all rows distinct.
1 Lecture 11: Basic SQL, Integrity constraints
The Relational Model Class 2 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) (From ER to Relational)
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 28 Database Systems I The Relational Data Model.
The Relational Model Ramakrishnan & Gehrke Chapter 3.
The Relational Model CS 186, Fall 2006, Lecture 2 R & G, Chap. 3.
SPRING 2004CENG 3521 The Relational Model Chapter 3.
The Relational Model Ramakrishnan & Gehrke, Chap. 3.
1 Query Languages: How to build or interrogate a relational database Structured Query Language (SQL)
The Relational Model 198:541 Rutgers University. Why Study the Relational Model?  Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle,
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
The Relational Model Lecture 3 Book Chapter 3 Relational Data Model Relational Query Language (DDL + DML) Integrity Constraints (IC) From ER to Relational.
1 Data Modeling Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 157 Database Systems I SQL Constraints and Triggers.
IST Databases and DBMSs Todd S. Bacastow January 2005.
1 The Relational Model Chapter 3. 2 Objectives  Representing data using the relational model.  Expressing integrity constraints on data.  Creating,
The Relational Model These slides are based on the slides of your text book.
Relational Data Model, R. Ramakrishnan and J. Gehrke with Dr. Eick’s additions 1 The Relational Model Chapter 3.
ASP.NET Programming with C# and SQL Server First Edition
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
1 The Relational Model Chapter 3. 2 Why Study the Relational Model?  Most widely used model.  Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3 Modified by Donghui Zhang.
1 The Relational Model Chapter 3. 2 Why Study the Relational Model?  Most widely used model  Vendors: IBM, Informix, Microsoft, Oracle, Sybase  Recent.
 Relational database: a set of relations.  Relation: made up of 2 parts: › Instance : a table, with rows and columns. #rows = cardinality, #fields =
1 The Relational Model Instructor: Mohamed Eltabakh
1 The Relational Model Chapter 3. 2 Why Study the Relational Model?  Most widely used model.  Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc.
1 The Relational Model. 2 Why Study the Relational Model? v Most widely used model. – Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. v “Legacy.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
1.1 CAS CS 460/660 Relational Model. 1.2 Review E/R Model: Entities, relationships, attributes Cardinalities: 1:1, 1:n, m:1, m:n Keys: superkeys, candidate.
Relational Data Model Ch. 7.1 – 7.3 John Ortiz Lecture 3Relational Data Model2 Why Study Relational Model?  Most widely used model.  Vendors: IBM,
Instructor: Dema Alorini Database Fundamentals IS 422 Section: 7|1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
The Relational Model Content based on Chapter 3 Database Management Systems, (Third Edition), by Raghu Ramakrishnan and Johannes Gehrke. McGraw Hill, 2003.
Programming Logic and Design Fourth Edition, Comprehensive Chapter 16 Using Relational Databases.
CMPT 258 Database Systems The Relationship Model PartII (Chapter 3)
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 5 – September 4 th,
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CMPT 258 Database Systems The Relationship Model (Chapter 3)
1 Databases II (Fall 2009) Professor: Iluju Kiringa SITE 5072.
The Relational Model Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY.
ICS 421 Spring 2010 Relational Model & Normal Forms Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/19/20101Lipyeow.
DBMS 3. course. Reminder Data independence: logical and physical Concurrent processing – Transaction – Deadlock – Rollback – Logging ER Diagrams.
CS34311 The Relational Model. cs34312 Why Relational Model? Currently the most widely used Vendors: Oracle, Microsoft, IBM Older models still used IBM’s.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Jennifer Widom Relational Databases The Relational Model.
Chapter 3 The Relational Model. Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. “Legacy.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
COP Introduction to Database Structures
CS 186, Fall 2006, Lecture 2 R & G, Chap. 3
Relational Algebra Chapter 4, Part A
Translation of ER-diagram into Relational Schema
Instructor: Mohamed Eltabakh
The Relational Model Content based on Chapter 3
The Relational Model Relational Data Model
Relational Databases The Relational Model.
Relational Databases The Relational Model.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
The Relational Model The slides for this text are organized into chapters. This lecture covers Chapter 3. Chapter 1: Introduction to Database Systems Chapter.
Instructor: Mohamed Eltabakh
The Relational Model Content based on Chapter 3
The Relational Model Content based on Chapter 3
Presentation transcript:

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS

Relational Database: Definitions Relational database: a set of relations Relation: made up of 2 parts: Instance : a table, with rows and columns. #Rows = cardinality, #fields = degree / arity. Schema : specifies name of relation, plus name and type of each column.  E.G. Students(sid: string, name: string, login: string, age: integer, gpa: real). Can think of a relation as a set of rows or tuples (i.e., all rows are distinct).

Example Instance of Students Relation Cardinality = 3, degree = 5, all rows distinct

Relational Query Languages A major strength of the relational model: supports simple, powerful querying of data. Queries can be written intuitively, and the DBMS is responsible for efficient evaluation.

The SQL Query Language Developed by IBM (system R) in the 1970s Need for a standard since it is used by many vendors Standards: SQL-86 SQL-89 (minor revision) SQL-92 (major revision, current standard) SQL-99 (major extensions)

The SQL Query Language To find all 18 year old students, we can write: SELECT * FROM Students S WHERE S.age=18 To find just names and logins, replace the first line: SELECT S.name, S.login

Querying Multiple Relations SELECT S.name, E.cid FROM Students S, Enrolled E WHERE S.sid=E.sid AND E.grade=“A”

Creating Relations in SQL Creates the Students relation. Observe that the type (domain) of each field is specified, and enforced by the DBMS whenever tuples are added or modified. As another example, the Enrolled table holds information about courses that students take. CREATE TABLE Students (sid: CHAR(20), name: CHAR(20), login: CHAR(10), age: INTEGER, gpa: REAL ) CREATE TABLE Enrolled (sid: CHAR(20), cid: CHAR(20), grade: CHAR (2))

Combining Separate Systems Use an IR and RDBMS systems which are independent. Divide the query into two: Structured part for the RDBMS Unstructured (text) part for the IR Combine the results from IR and RDBMS Good for letting each vendor develop its own system Bad for data integrity, recovery, portability, and performance

User Defined Operators Allow users to modify SQL by adding their own functions Some vendors used this approach (such as IBM DB2 text extender) Lynch and Stonebreaker defined “user defined operators” to implement information retrieval in 1988 //Retrieves documents that contain term1, term2, term3 SELECT Doc_Id FROM Doc WHERE SEARCH-TERM(Text, Term1, Term 2, Term3) //Retrieves documents that contain term1, term2, term3 // within a window of 5 terms SELECT Doc_Id FROM Doc WHERE PROXIMITY(Text,5, Term1, Term 2, Term3)

Non-First Normal Form Approaches Capture the many-to-many relationships into sets via nested relations Hard to implement ad-hoc queries No standard yet

Using RDBMS for IR Benefits: Recovery Performance Data migration Concurrency Control Access control mechanism Logical and physical data independence

Using RDBMS for IR Example: A bibliography that includes both structured and unstructured information DIRECTORY (name, institution) : affiliation of the author AUTHOR(name,DocId) :authorship information INDEX (name, DocId) :terms that are used to index a document

Using RDBMS for IR Preprocessing SGML can be used as a starting point which is a standard for defining parts of documents WSJ How to make students suffer in IR Course 03/23/87 Sabanci, Turkey Crawler HW, Inverted Index, Querying

Using RDBMS for IR Preprocessing SGML can be used as a starting point which is a standard for defining parts of documents Use a parser together with a hash function to identify terms Use STOP_TERM table for referencing stop words Produce three output tables  INDEX (DocId, Term, TermFrequency) : Models the inverted index  DOC (DocId, DocName, PubDate, DateLine) : Document metadata  TERM (Term, Idf) : stored the weights of each term //Construct TERM table, N is the total number of documents INSERT INTO TERM SELECT Term,log(N/Count(*)) FROM INDEX GROUP BY Term

Using RDBMS for IR An offset can be added together with the term to be able to answer proximity queries. For example “Vice President” should occur together in the same document for relevant documents etc. INDEX_PROX (DocId, Term, OffSet) //Construct TERM table, N is the total number of documents INSERT INTO INDEX SELECT DocId, Term, COUNT(*) FROM INDEX_PROX GROUP BY DocId, Term

Using RDBMS for IR Query can be modeled as a relation as well when it is a long document QUERY(Term,TermFreq) Ex: “Find all news documents written on 03/03/2005 about Sabanci University Data will be extracted from the structured fields Terms will be extracted using the inverted index SELECT d.DocId FROM DOC d, INDEX i WHERE i.Term IN (“Sabanci”, “University”) AND d.PubDate = “03/03/2005” AND d.DocId = i.DocId

Using RDBMS for IR Boolean Queries: Consists of terms with boolean operators (AND, OR, and NOT) For a single inputTerm: retrieve the document texts that contain that term SELECT d.Text FROM DOC d, WHERE d.DocId IN (SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm) Note that we can store the text part of a document using BLOB or CLOG ( Binary or Character Large Object)

Using RDBMS for IR Boolean Queries that contain OR SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 OR i.Term = inputTerm2 OR ….. i.Term = inputTermn OR

Using RDBMS for IR Boolean Queries that contain AND SELECT DISTINCT (i.DocId) FROM INDEX i WHERE i.Term = inputTerm1 AND i.Term = inputTerm2 AND ….. i.Term = inputTermn AND ??

Using RDBMS for IR Boolean Queries that contain AND (Previous Answer Was Wrong) SELECT DISTINCT (i.DocId) FROM INDEX i1, INDEX i2, INDEX i3, …. INDEX in WHERE i1.Term = inputTerm1 AND i2.Term = inputTerm2 AND ….. in.Term = inputTermn AND i1.DocID = i2.DocId AND i2.DocID = i3.DocId AND … in-1 = in.DocID OR YOU CAN USE INTERSECTION

Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(i.Term) = (SELECT COUNT(*) FROM QUERY) Works only when the INDEX contains only one occurrence of a given term Together with its frequency. No Proximity is recorded.

Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Solution for terms appearing more than once in the INDEX SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) = (SELECT COUNT(*) FROM QUERY) This is slower since DISTINC requires a sort for duplicate elimination.

Using RDBMS for IR Boolean Queries that contain AND Commercial DBMSs are not able to process more than a fixed number of joins. Implementation of TAND (Threshold AND) is also simple SELECT i.DocId FROM INDEX i, Query q WHERE i.Term = q.term GROUP BY i.DocId HAVING COUNT(DISTINCT(i.Term)) > k

Using RDBMS for IR Proximity Queries for terms within a specific window width SELECT a.DocId FROM INDEX_PROX a, INDEX_PROX b WHERE a.Term IN (SELECT q.Term FROM QUERY q) AND b.Term IN (SELECT q.Term FROM QUERY q) AND a.DocId = b.DocId AND (a.offset –b.offset) BETWEEN 0 AND (width-1) GROUP BY a.DocId, b.DocId, a.Term, a.offset HAVING COUNT(DISTINCT(b.Term)) = SELECT (COUNT(*) FROM QUERY)

Using RDBMS for IR Calculating Relevance SELECT i.DocId, SUM(q.tf*t.idf*t.tf*t.idf) FROM QUERY q, INDEX i, TERM t WHERE q.Term = t.term AND i.Term = t.Term GROUP BY i.DocId ORDER BY 2 DESC