CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles.

Slides:



Advertisements
Similar presentations
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizards Guide to PHP by David Lash.
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Final Project of Information Retrieval and Extraction by d 吳蕙如.
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
CS/Info 430: Information Retrieval
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
CS320 Web and Internet Programming SQL and MySQL Chengyu Sun California State University, Los Angeles.
Attribute databases. GIS Definition Diagram Output Query Results.
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2010 All Rights Reserved. 1.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
DAT702.  Standard Query Language  Ability to access and manipulate databases ◦ Retrieve data ◦ Insert, delete, update records ◦ Create and set permissions.
 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.
Word Up! Using Lucene for full-text search of your data set.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
DAY 21: MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Akhila Kondai October 30, 2013.
Welcome to the Web of Science tutorial By the end of this tutorial you should be able to: Do a basic search to find references Use search techniques to.
Session 5: Working with MySQL iNET Academy Open Source Web Development.
Slide 8-1 CHAPTER 8 Using Databases with PHP Scripts: Using MySQL Database with PHP.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizard’s Guide to PHP by David Lash.
Search Engines and Information Retrieval Chapter 1.
1 PHP and MySQL. 2 Topics  Querying Data with PHP  User-Driven Querying  Writing Data with PHP and MySQL PHP and MySQL.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Creating Dynamic Web Pages Using PHP and MySQL CS 320.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
Kirkwood Center for Continuing Education Introduction to PHP and MySQL By Fred McClurg, Copyright © 2015, Fred McClurg, All Rights.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Visual Programing SQL Overview Section 1.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Lucene Jianguo Lu.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Copyright © 2003 Pearson Education, Inc. Slide 8-1 The Web Wizard’s Guide to PHP by David Lash.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
CS422 Principles of Database Systems Object-Relational Mapping (ORM) Chengyu Sun California State University, Los Angeles.
CS520 Web Programming Object-Relational Mapping with Hibernate and JPA (I) Chengyu Sun California State University, Los Angeles.
CS320 Web and Internet Programming SQL and MySQL Chengyu Sun California State University, Los Angeles.
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Introduction to Database Programming with Python Gary Stewart
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
CS520 Web Programming Full Text Search
Chengyu Sun California State University, Los Angeles
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
CS320 Web and Internet Programming SQL and MySQL
Searching and Indexing
Text Based Information Retrieval
Information Retrieval and Web Search
Building Search Systems for Digital Library Collections
Multimedia Information Retrieval
Information Retrieval
Chapter 5: Information Retrieval and Web Search
CS3220 Web and Internet Programming SQL and MySQL
CS3220 Web and Internet Programming SQL and MySQL
Chengyu Sun California State University, Los Angeles
Information Retrieval and Web Design
CS4540 Special Topics in Web Development SQL and MS SQL
Presentation transcript:

CS520 Web Programming Full Text Search Chengyu Sun California State University, Los Angeles

Search Text Web search Desktop search Applications Search posts in a bulletin board Search product descriptions at an online retailer …

Database Query Find the posts regarding “SSHD login errors”. select * from posts where content like ‘%SSHD login errors%’; Here are the steps to take to fix the SSHD login errors: … Please help! I got SSHD login errors!

Problems with Database Queries And how about performance?? There a problem recently discovered regarding SSHD and login. The error message is usually … Please help! I got an error when I tried to login through SSHD! The solution for sshd/login errors: …

Full Text Search (FTS) More formally known as Information Retrieval (IR) Search LARGE amount of textual data

Characteristics of FTS Vs. Databases “Fuzzy” query processing Relevancy ranking

Accuracy of FTS Precision = # of relevant documents retrieved # of documents retrieved Recall = # of relevant documents retrieved # of relevant documents

Journey of a Document Stripping non-textual data tokenizing Removing stop words Stemming Indexing document index

Document OriginalText-only The solution for sshd/login errors: … The solution for sshd/login errors: …

Convert Different Document Types to Text HTML NekoHTML PDF PDFBox MS Word Apache POI More at Lucence FAQ - lucene/LuceneFAQ lucene/LuceneFAQ

Tokenizing [the] [solution] [for] [sshd] [login] [errors] …

Chinese Text Example Unigram: 今天天气不错。 [ 今 ] [ 天 ] [ 天 ] [ 气 ] [ 不 ] [ 错 ] Bigram: Text: [ 今天 ] [ 天天 ] [ 天气 ] [ 气不 ] [ 不错 ] Grammar-based: [ 今天 ] [ 天气 ] [ 不错 ]

Stop Words Words that do not help in search and retrieval Function words: a, an, and, the, of, for … After stop words removal: [the] [solution] [for] [sshd] [login] [errors] … Problem of stop word removal??

Stemming Reduce a word to its stem or root form. Examples: connection, connections connected, connecting connective connect [solution] [sshd] [login] [errors] … [solve] [sshd] [login] [error] …

Inverted Index cat dog keywords buckets documents positions # of words 22,

Query Processing tokenizing Removing stop words Stemming Searching Query results Ranking

Relevancy (Similarity) How well the document matches the query E.g. weighted vector distance How “important” the document is E.g. based on ratings, citations, and links

Example: Lucene Similarity Score pi/core/org/apache/lucene/search/Simil arity.html # of times a term appears in a document # of documents that contain the term # of query terms found length of a field boost factor - field and/or document …

FTS Implementations Databases MySQL: MyISAM tables only PostgreSQL (since 8.3) Oracle, DB2, MS SQL Server,... Stand-alone IR libraries Lucene, Egothor, Xapian, MG4J, …

FTS in PostgreSQL Since 8.3 tsearch/tsearch2 module before eractive/textsearch.html

Text Search Configuration Specify the options to transform a document to a tsvector – tokenization, stop words removal, stemming etc. psql commands \dF show default_text_search_config; set default_text_search_config=english; Change default text search configuration in $DATA/postgresql.conf

Sample Schema create table messages ( id serial primary key, subject varchar(4092), content text, author varchar(255) );

Basic Data Types and Functions Data types tsvector tsquery Functions to_tsvector to_tsquery plainto_tsquery

Query Syntax full text search full & text & search full & text | search full & !text | search (! full | text ) & search plainto_tsqueryto_tsquery

The Match Operator tsvector tsquery tsquery tsvector text tsquery to_tsvector(text) tsquery text text to_tsvector(text) plainto_tsquery(text) Note that there is no tsquery text.

Query Examples Find the messages that contain “computer programs” in the content Find the messages that contain “computer programs” in either the content or the subject

Create an Index on Text Column(s) Expression (function) index The language parameter is required in both index construction and query create index messages_content_index on messages using gin(to_tsvector('english',content));

Use a Separate Column for Text Search Create a tsvector column Use a trigger to update the column

Create an Index on the tsvector Column The language parameter is no longer required create index messages_tsv_index on messages using gin(tsv);

More Functions setweight(tsvector, ”char”) A: 1.0 B: 0.4 C: 0.2 D: 0.1 ts_rank(tsvector, tsquery) ts_headline(text, tsquery)

Function Examples Set the weight of subject to be “A” and the weight of content to be “D” List the results by their relevancy scores and highlight the query terms in the results

Using Native SQL in JPA String sql = “select * from employees where id = ?”; entityManager.createNaiveQuery(sql, Employee.class).setParameter(1, employeeId).getResultList();

Named Query in name=“employee.findAll”, query=“select * from employees” name=“employee.findById”, query=“from Employee where id = :id” ) }) public class Employee { …. } A named query can be JPQL or SQL.

Named Query in Hibernate Mapping File <![CDATA[ select * from messages where tsv plainto_tsquery(?) ]]>

Using Named Query in DAO entityManager.createNamedQuery(“employee.findAll”, Employee.class).getResultList(); entityManager.createNamedQuery(“employee.findById”, Employee.class).setParameter( “id”, employeeId ).getSingleResult();

Example: Course Search in CSNS2 Course CourseDao and CourseDaoImpl NamedQueries.hbm.xml csns-create.sql