Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈. Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Retrieval in Practice
Chapter 5: Introduction to Information Retrieval
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Management Information Systems, Sixth Edition
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Information Retrieval in Practice
Search Engines and Information Retrieval
Introduction to Structured Query Language (SQL)
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Chapter 14 The Second Component: The Database.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Introduction to Structured Query Language (SQL)
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Survey of Patent Search Engine Software Jennifer Lewis April 24, 2007 CSE 8337.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Overview of IU Digital Collections Search Hui Zhang Jon Dunn Indiana University Digital Library Program IU Digital Library Brown Bag October 19, 2011.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
 2004 Prentice Hall, Inc. All rights reserved. 1 Segment – 6 Web Server & database.
Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
Web- and Multimedia-based Information Systems Lecture 2.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Cross Language Clone Analysis Team 2 February 3, 2011.
Lucene Jianguo Lu.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Information Retrieval in Practice
Image taken from: slideshare
Why indexing? For efficient searching of a document
Search Engine Architecture
Searching and Indexing
Physical Database Design and Performance
Building Search Systems for Digital Library Collections
Database Performance Tuning and Query Optimization
Multimedia Information Retrieval
Search Techniques and Advanced tools for Researchers
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to Text Analysis
Chapter 11 Database Performance Tuning and Query Optimization
Information Retrieval and Web Design
Introduction to Search Engines
Map Reduce, Types, Formats and Features
Presentation transcript:

Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈

Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network 5.Text Processing Operations 6.Apache Lucene 7.Apache Solr 8.Demo 5/18 5/20

Text Processing Operations ●Text processing operations o Classification o Clustering o Part-of-speech tagging o Parsing o Sentiment analysis o Language modeling o Named entity recognition o etc. ●Why indexing is important to above operations?

Classification ●Classification o Automatically classify items into correct classes o Supervised learning ●Text classification o Classify documents using text features o Used as a common approach to many text processing operations ●Examples o Spam filter o routing o Language identification o etc.

Classification ●Approaches o Probabilistic  e.g. Naive Bayes o Geometric  e.g. Support vector machine o Artificial neural network o Decision tree o etc.

Clustering ●Unsupervised learning ●Based on similarity values, make groups of similar items ●Text clustering o Large volume o Sparse data o e.g. grouping documents sharing a same topic Image from

Clustering ●Approaches o Mainly statistical o Hierarchical o Partitional o … ●Examples o k-means o affinity propagation Images from

Language Modeling ●The method for representing language in machine-comprehensible form ●Approaches o Probabilistic language model  Use probability of a sequence of words o Recently, neural language models are widely used  Use neural network to map language into value

POS Tagging ●Every word has its part-of-speech tag o noun, verb, adjective, adverb, … o e.g. What is the airspeed of an unladen swallow?  What/WP is/VBZ the/DT airspeed/NN of/IN an/DT unladen/JJ swallow/VB o e.g. 아버지가 방에 들어가신다  아버지 /NNG 가 /JKS 방 /NNG 에 /JKB 들어가 /VV 시 /EPH 다 /EFN ●Approaches o classifier, sequence model, rule based,... ●Partly easy problem o Many words are unambiguous o Even stupidest method’s performance is about 90% o State-of-the-art method’s performance is about 97%

Parsing ●Syntactic structure o Constituency (phrase structure) o Dependency ●Parsing solves ambiguity of sentences ●Approaches o Pre-1990: by defining symbolic grammar o After that: statistical method  due to the rise of annotated data (e.g. Penn Treebank)

Sentiment Analysis ●Detection of attitudes ●Types of sentiment analysis o Whether the attitude is positive/negative o Rank the attitude from 1 to 5 o Or more complex types

Sentiment Analysis ●Approaches o Classification o Regression o Using lexicon (e.g. WordNet) ●Why sentiment analysis? o For companies, to know consumers’ opinions on a product o For politicians, to know people’s oponions on a candidate or an issue ●Also known as o Oponion extraction, opinion mining, sentiment mining, subjectivity analysis

Named Entity Recognition ●Important sub-task of information extraction ●Find and classify names in text ●Approaches o Sequence model o Lexicon o Classification

Why Index? ●Many operations are based on statistical approach o Large number of documents ●Retrieving documents from their words is a very frequent task o Word is the common unit of many operations

References Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003):

Apache Lucene

●Lucene? o Open-source Java full-text search “Library” o Makes it easy to add search functionality to an application or website o NOT Care about the source of the data, its format, or even its language  as long as you can convert it to text ●Main Capabilities o Creation / Maintenance / Accessibility of the Lucene inverted index Lucene Overview

●Basic Process 1.Adds content to a full-text index 2.Performs queries on this index 3.Returns results ranked by a. The relevance to the query b. An arbitrary field i. e.g., Last modified date

How to Make Content Searchable ●Search engines generally: a. Extract Tokens from content b. Optionally transform the tokens depending on needs  Stemming  Expand with synonyms (usually done at query time)  Romove token (stopword)  Add metadata c. Store tokens and related metadata (position, etc.) in a data structure optimized for searching  Called an Inverted Index

●Inverted Index o Searches an index instead of searching the text directly o Page-centric structure (page->words) to a keyword- centric data structure (word->pages) Terms

●Documents o The unit of search and index o An index consists of one or more Documents o Content can be from various sources  SQL/NoSQL database, a file system, websites o e.g.) Lucene index of a database table of users ● Each user = Lucene Document

●Fields o A Document consists of one or more Fields o Simply a name-value pair  e.g.) Title : Avengers Terms

●Fields o Types  Keyword ●Not analyzed, but indexed and stored ●Original value should be preserved in its entirety ●e.g.) File system path, dates...  UnIndexed ●Neither analyzed nor indexed, but stored as is ●Need to display with search results, but whose values you’ll never search directly ●e.g.) Database primary key...  UnStored ●Analyzed and indexed but not stored ●Large amount of text that doesn’t need to be retrieved in its original form ●e.g.) Bodies of web pages, any other type of text document  Text ●Analyzed and indexed ●If String, stored ●If the data is from a Reader, not stored Terms

An example of Lucene Fields

Terms ●Attributes o Tokenized  Analyze the content, extracting Tokens and adding them to the inverted index o Stored  Keep the content in a strorage data structure for use by application

Lucene Architecture

Lucene Functionality 1.Language Analysis 2.Indexing 3.Querying 4.Ancillary Features The Core of Lucene

Language Analysis

●Overview o The process of converting raw text into indexable tokens o Analyzer = Tokenizer + TokenFilter classes  Lucene provides many Analyzers out-of-the-box ● StandardAnalyzer, WhitespaceAnalyzer, etc.  Tokenizer for chunking the input into Tokens  TokenFilter can further modify the Tokens o Easy to add your own o Done on both the content to be indexed and the query

Language Analysis ●Input o Contents (documents) to be indexed o Queries to be searched ●Output o Appropriate internal representation as needed Input Output

1.Optional character filtering and normalization a. e.g.) removing diacritics 2. Tokenization a.“Time is an illusion. Lunch time doubly so.” ==> [“Time”, “is”, “an”, “illusion.”, “Lunch”, “time”, “doubly”, “so.”] Language Analysis

3.Token Filtering a. Stopword removal i. Remove words too common to be useful ii. e.g.) and, a, the, but, … b. Stemming i. Chop off the ends of words to map different forms of a word to a single form ii. e.g.) lazy, laziness -> lazi Language Analysis

3.Token filtering c. Lemmatization  Remove inflectional endings only and return the base or dictionary form of a word (lemma)  e.g.) better, best -> good d. N-gram createion  For approximate matching  e.g.) “This is my car” ⇒ [“This”, “is”, “my”, “car”], [“This is”, “is my”, “my car”], [“This is my”, “is my car”] Language Analysis

Indexing

●Indexing o Prepare / Add text to Lucene o Optimized for searching ●Lucene Indexing o Well-known inverted index representation o Keeping adjacent non-inverted data on a per- document basis ●Key Point o Lucene only indexes Strings  Convert whatever file format we have into something Lucene can use Indexing

Indexing with Lucene ●Overview o Fast: over 200 GB/hour o Incremental and “near-realtime” o Multi-threaded o Beyond full-text: numbers, dates, binary,... o Customize what is indexed (“analysis”) o Customize index format (“codecs”)

Indexing ●Document Model o A flat ordered list of fields with content o Fields have name, content data, float weight, and other attributes o Does not need to have a unique identifier

Indexing ●Store terms and documents in arrays

Indexing ●Insertions? o Insertion = write a new segment o Merge segments when there are too many of them o concatenate docs, merge terms, dicts and postings lists (merge sort!)

Indexing ●Deletions? o Deletion = turn a bit off o Ignore deleted documents when searching and merging (reclaims space) o Merge policies favor segments with many deletions

●Updates require writing a new segment o Single-doc updates are costly, bulk updates prefered o Writes are sequential ●Segments are never modified in place o Filesystem-cache-friendly o Lock-free! ●Terms are deduplicated o Saves space for high-freq terms ●Docs are uniquely identified by an ord o Useful for cross-API communication o Lucene can use several indexes in a single query ●Terms are uniquely idendified by an ord o Important for sorting: compare longs, not strings o Important for faceting Indexing

●Term vectors o Per-document inverted index o Useful for more-like-this ( 연관 검색어 ) o Sometimes used for highlighting

Indexing ●Numeric/binary doc values o Per doc and per field single numeric values o Useful for sorting and custom scoring o Norms are numeric doc values

Indexing ●Sorted (set) doc values o Original-enabled per-doc and per-field values  Sorted: single-valued, useful for sorting  Sorted set: multi-valued, useful for faceting

Indexing ●Stored fields vs Doc values o Optimized for different access patterns  get many field values for a few docs: stored fields  get a few field values for many docs: doc values

Indexing ●Lucene APIs

Querying

●Lucene Query Parser converts strings into Java objects that can be used for searching ●Qeury objects can also be constructed programmatically ●Native support for many types of queries o Keyword o Phrase o Wildcard o Many more

Core Searching classes ●IndexSearcher ●Term o Basic unit for searching, consists of the field and the value of that field ●Query o TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery ●Hits o Simple container of pointers to ranked search results

Types of Queries 1.TermQuery a. Useful for retrieving documents by a key b. When the expression consists of a single word 2.PrefixQuery a. Matches documents containing terms beginning with a specified string b. When it ends with an asterisk(*) in query expressions 3.RangeQuery a. Facilitates searches from a starting term through an ending term

Types of Queries 4.BooleanQuery o A container of Boolean clauses o A clause is a subquery that can be optional, required, or prohibited 5.PhraseQuery o An index contains positional information of terms o Uses this information to locate documents where terms are within a certain distance of one another 6.FuzzyQuery o Matches terms similar to a specified term

Querying ●Support a variety of query options o Ability to filter, page, and sort results o Pseudo relevance feedback ●Over 50 different kinds of query representations ●Several query parsers ●A query parsing framework

Various Types of Queries

Analysis and Search Relevancy

Lucene Tutorial 1.Download Lucene from Write Code a. Indexing Side i. Write code to add Documents to index b. Search Side i. Write code to transform user query into Lucene Query instances ii. Submit Query to Lucene to search iii. Display results

Basic Application

1. lemmatization-1.htmlhttp://nlp.stanford.edu/IR-book/html/htmledition/stemming-and- lemmatization-1.html es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdfhttps://fosdem.org/2015/schedule/event/apache_lucene_5/attachments/slid es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdf inaluceneagrandfinal?from_action=savehttp:// inaluceneagrandfinal?from_action=save References

Apache Solr

Relationship between Lucene & Solr ●Engine & Car o Lucene  A programmatic library which you can't use as-is o Solr  A complete application which you can use out-of-box

Solr Overview ●Solr? o Web application o Enterprise search platform built on Lucene o Highly reliable, scalable, fault tolerant ●Solr is not a HTTP wrapper of Lucene o It adds many functionalities to Lucene o Some features of Solr are implemented before they are available in Lucene

Solr vs. Lucene ●Solr uses Lucene library, but extends it o Data-driven schemaless mode o Faceted search and filtering o Geospatial search o Performance optimizations o Monitoring o Rich document parsing o and so on...

Solr Functionality ●Advanced full-text search ●Scalability & Fault tolerance ●Open interfaces ●Administration interfaces ●Easy monitoring ●Easy configuration ●Near real-time indexing ●Extensible plugins

Scaling ●On distributed systems, Solr provides... o High scalability o Fault tolerance ●Built on Apache Zookeeper o Coordinator for distributed systems o Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Open Interfaces ●REST-like API o Invoke diverse operations via HTTP requests ●XML, JSON, CSV, binary format o Put data with these formats o Receive data in these formats ●Easy integration with any language

Web Interfaces ●Provides administrative and monitoring features

Web Interfaces (cont.) ●Provides querying interfaces ●Provides various querying options

Solr Query ●Solr query supports… o keyword matching o wildcard matching o proximity matching o range search o assigning different weights on search conditions o function query

Demo

●RDBMS vs. Text search platform o Does one size fit all? o Comparison on features and performances between two database systems o MySQL(RDBMS) vs. Solr(Text search platform)

MySQL vs. Solr ●MySQL o RDBMS used for general purposes ●Solr o Search platform, targeting only on text retrieval ●Questions o Will be the performance difference between two systems significant? o Does Solr have other advantages over traditional DBMSs?

Settings ●Yelp review dataset o o Dataset we used in course projects o Served as JSON form o Over 1.5 million reviews

Importing Data ●There is no direct method to import JSON data into MySQL o We have to insert them article-by-article, o Or load them into DB after converting to CSV file ●In Solr, parsing rich document is available o XML, JSON, PDF, Word, etc. o Powered by Apache Tika

Importing Data ●In MySQL, … o Convert to CSV o And load

Importing Data ●In Solr… o Single line command

Test #1: Matching Single Term ●Retrieve documents that have the word ‘cuisine’ in their contents ●In SQL, o SELECT * FROM review WHERE text LIKE '%cuisine%'; ●In Solr query, o HTTP request to ‘/select’, with parameter o q=text:cuisine

Test #1: Matching Single Term ●Video clip: MySQL

Test #1: Matching Single Term Video clip: Solr

Test #2: Matching with Conditions ●Retrieve documents with below conditions o contains ‘meal’ in text o contains ‘coffee’ in text o does not contain ‘china’ in text o star rating is over 3 o written before 2012 o sort by date, ascending order o retrieve up to 500 documents

Test #2: Matching with Conditions ●In MySQL, SELECT * FROM review WHERE text LIKE '%meal%' AND text LIKE '%coffee%' AND text NOT LIKE '%china%' AND stars > 3 AND date < ' :00:00.000' ORDER BY date ASC LIMIT 500; ●In Solr, HTTP request to ‘/select’, with parameter q=text:meal AND text:coffee AND -text:china AND stars:{3 TO *} AND date:{* TO T00:00:00Z} sort=date ASC rows=500

Test #2: Matching with Conditions ●Video clip: MySQL

Test #2: Matching with Conditions ●Video clip: Solr

Test #3: Proximity Search ●Proximity search o Matching term occurrences within a specified distance o e.g. ‘hotel’ and ‘california’ within distance 4  This hotel is located in California  Welcome to the Hotel California, such a lovely place

Test #3: Proximity Search ●In MySQL o Can we do it…? ●In Solr o text:"hotel california"~4 o or, {!surround} text:3w(hotel, california)

Test #4: Faceted Search ●Solr supports faceted search feature o This allows users to explore information by applying multiple filters o Dynamic clustering of search results into categories that let users drill into search results by any value in any field. ●Popular technique for commercial applications

Test #4: Faceted Search ●When to use? o I want to find a specific item o but it is hard to define what I want to find ●By faceted search, we can remove irrelevant candidates, by applying filters

Test #4: Faceted Search ●Define filters o Star rating o Date written  From  To  By 3-month interval ●Query o GET request to ‘/select’, with parameters  q=*  facet=true  facet.field=stars  facet.date=date  f.date.facet.date.start= T00:00:00Z  f.date.facet.date.end= T00:00:00Z  f.date.facet.date.gap= +3MONTH

Test #4: Faceted Search

●q=* ●facet=true ●facet.field=stars ●facet.date=date ●f.date.facet.date.start= T00:00:00Z ●f.date.facet.date.end= T00:00:00Z ●f.date.facet.date.gap=+3MONTH ●fq=stars:4 ●fq=date:{ T00:00:00Z TO T00:00:00Z+3MONTH}

Test #4: Faceted Search ●In MySQL? o SELECT stars, COUNT(*) FROM review GROUP BY stars o SELECT YEAR(date), (CASE WHEN MONTH(date) >= 1 AND MONTH(date) = 4 AND MONTH(date) = 7 AND MONTH(date) = 10 AND MONTH(date) = ' :00:00' AND date < ' :00:00' GROUP BY YEAR(date), period; ●Long and messy!

Test #5: Language Analysis ●For an efficient text retrieval, language analysis techniques are used o Stemming o Synonyms o Stopword removal o etc.

Test #5: Language Analysis ●In Solr, we can apply filters on index and query o Some filters are applied automatically in default, by language o For English, Porter stemmer is used defaultly  Of course, we can change a stemmer to use

Test #5: Language Analysis ●‘a nice hotel’, ‘hotel with niceness’ o They will give us same results, due to stemming and stopword removal

Test #5: Language Analysis ●Are these possible in MySQL? o Almost impossible by MySQL itself o Should be done in application-level, not DB-level

Results ●RDBMS vs. Text search platform o Response time  Using indices, text search platform retrieved documents faster o Rich search functionalities  Text search platform gives us rich functionalities such as proximity search and faceted search o Language Analysis  Text search platform applies filters on index and query to find synonymy terms, terms experienced inflection, etc.

Conclusions ●For text retrieval, Solr outperforms MySQL o What if updates occur frequently? o What if we need to find documents not by words? o What if we need complex join operations? ●Does one size fit all? o RDBMS is a possible good choice for general purposes, but there exist systems for a specific domain ●We have to select a suitable system o If you are a database engineer who has to build text retrieval system, text retrieval engine might be a good choice

References e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdfhttps:// e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdf 3. with-solr/ with-solr/