Download presentation
Presentation is loading. Please wait.
Published byDora Ferguson Modified over 9 years ago
1
Text processing Team 3 김승곤 박지웅 엄태건 최지헌 임유빈
2
Outline 1.Introduction 2.Text Processing 3.Index Techniques in Database 4.Index Techniques in Wireless Network 5.Text Processing Operations 6.Apache Lucene 7.Apache Solr 8.Demo 5/18 5/20
3
Text Processing Operations ●Text processing operations o Classification o Clustering o Part-of-speech tagging o Parsing o Sentiment analysis o Language modeling o Named entity recognition o etc. ●Why indexing is important to above operations?
4
Classification ●Classification o Automatically classify items into correct classes o Supervised learning ●Text classification o Classify documents using text features o Used as a common approach to many text processing operations ●Examples o Spam filter o Email routing o Language identification o etc.
5
Classification ●Approaches o Probabilistic e.g. Naive Bayes o Geometric e.g. Support vector machine o Artificial neural network o Decision tree o etc.
6
Clustering ●Unsupervised learning ●Based on similarity values, make groups of similar items ●Text clustering o Large volume o Sparse data o e.g. grouping documents sharing a same topic Image from http://analyticstraining.com/2011/cluster-analysis-for-business/
7
Clustering ●Approaches o Mainly statistical o Hierarchical o Partitional o … ●Examples o k-means o affinity propagation Images from http://scikit-learn.org/stable/modules/clustering.html
8
Language Modeling ●The method for representing language in machine-comprehensible form ●Approaches o Probabilistic language model Use probability of a sequence of words o Recently, neural language models are widely used Use neural network to map language into value
9
POS Tagging ●Every word has its part-of-speech tag o noun, verb, adjective, adverb, … o e.g. What is the airspeed of an unladen swallow? What/WP is/VBZ the/DT airspeed/NN of/IN an/DT unladen/JJ swallow/VB o e.g. 아버지가 방에 들어가신다 아버지 /NNG 가 /JKS 방 /NNG 에 /JKB 들어가 /VV 시 /EPH 다 /EFN ●Approaches o classifier, sequence model, rule based,... ●Partly easy problem o Many words are unambiguous o Even stupidest method’s performance is about 90% o State-of-the-art method’s performance is about 97%
10
Parsing ●Syntactic structure o Constituency (phrase structure) o Dependency ●Parsing solves ambiguity of sentences ●Approaches o Pre-1990: by defining symbolic grammar o After that: statistical method due to the rise of annotated data (e.g. Penn Treebank)
11
Sentiment Analysis ●Detection of attitudes ●Types of sentiment analysis o Whether the attitude is positive/negative o Rank the attitude from 1 to 5 o Or more complex types
12
Sentiment Analysis ●Approaches o Classification o Regression o Using lexicon (e.g. WordNet) ●Why sentiment analysis? o For companies, to know consumers’ opinions on a product o For politicians, to know people’s oponions on a candidate or an issue ●Also known as o Oponion extraction, opinion mining, sentiment mining, subjectivity analysis
13
Named Entity Recognition ●Important sub-task of information extraction ●Find and classify names in text ●Approaches o Sequence model o Lexicon o Classification
14
Why Index? ●Many operations are based on statistical approach o Large number of documents ●Retrieving documents from their words is a very frequent task o Word is the common unit of many operations
15
References 1.https://web.stanford.edu/~jurafsky/NLPCourseraSlides.htmlhttps://web.stanford.edu/~jurafsky/NLPCourseraSlides.html 2.http://www.nltk.org/api/nltk.tag.htmhttp://www.nltk.org/api/nltk.tag.htm 3.Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-1155. 4.http://scikit-learn.org/stable/documentation.htmlhttp://scikit-learn.org/stable/documentation.html
16
Apache Lucene
17
●Lucene? o Open-source Java full-text search “Library” o Makes it easy to add search functionality to an application or website o NOT Care about the source of the data, its format, or even its language as long as you can convert it to text ●Main Capabilities o Creation / Maintenance / Accessibility of the Lucene inverted index Lucene Overview
18
●Basic Process 1.Adds content to a full-text index 2.Performs queries on this index 3.Returns results ranked by a. The relevance to the query b. An arbitrary field i. e.g., Last modified date
19
How to Make Content Searchable ●Search engines generally: a. Extract Tokens from content b. Optionally transform the tokens depending on needs Stemming Expand with synonyms (usually done at query time) Romove token (stopword) Add metadata c. Store tokens and related metadata (position, etc.) in a data structure optimized for searching Called an Inverted Index
20
●Inverted Index o Searches an index instead of searching the text directly o Page-centric structure (page->words) to a keyword- centric data structure (word->pages) Terms
21
●Documents o The unit of search and index o An index consists of one or more Documents o Content can be from various sources SQL/NoSQL database, a file system, websites o e.g.) Lucene index of a database table of users ● Each user = Lucene Document
22
●Fields o A Document consists of one or more Fields o Simply a name-value pair e.g.) Title : Avengers Terms
23
●Fields o Types Keyword ●Not analyzed, but indexed and stored ●Original value should be preserved in its entirety ●e.g.) File system path, dates... UnIndexed ●Neither analyzed nor indexed, but stored as is ●Need to display with search results, but whose values you’ll never search directly ●e.g.) Database primary key... UnStored ●Analyzed and indexed but not stored ●Large amount of text that doesn’t need to be retrieved in its original form ●e.g.) Bodies of web pages, any other type of text document Text ●Analyzed and indexed ●If String, stored ●If the data is from a Reader, not stored Terms
24
An example of Lucene Fields
25
Terms ●Attributes o Tokenized Analyze the content, extracting Tokens and adding them to the inverted index o Stored Keep the content in a strorage data structure for use by application
26
Lucene Architecture
27
Lucene Functionality 1.Language Analysis 2.Indexing 3.Querying 4.Ancillary Features The Core of Lucene
28
Language Analysis
29
●Overview o The process of converting raw text into indexable tokens o Analyzer = Tokenizer + TokenFilter classes Lucene provides many Analyzers out-of-the-box ● StandardAnalyzer, WhitespaceAnalyzer, etc. Tokenizer for chunking the input into Tokens TokenFilter can further modify the Tokens o Easy to add your own o Done on both the content to be indexed and the query
30
Language Analysis ●Input o Contents (documents) to be indexed o Queries to be searched ●Output o Appropriate internal representation as needed Input Output
31
1.Optional character filtering and normalization a. e.g.) removing diacritics 2. Tokenization a.“Time is an illusion. Lunch time doubly so.” ==> [“Time”, “is”, “an”, “illusion.”, “Lunch”, “time”, “doubly”, “so.”] Language Analysis
32
3.Token Filtering a. Stopword removal i. Remove words too common to be useful ii. e.g.) and, a, the, but, … b. Stemming i. Chop off the ends of words to map different forms of a word to a single form ii. e.g.) lazy, laziness -> lazi Language Analysis
33
3.Token filtering c. Lemmatization Remove inflectional endings only and return the base or dictionary form of a word (lemma) e.g.) better, best -> good d. N-gram createion For approximate matching e.g.) “This is my car” ⇒ [“This”, “is”, “my”, “car”], [“This is”, “is my”, “my car”], [“This is my”, “is my car”] Language Analysis
34
Indexing
35
●Indexing o Prepare / Add text to Lucene o Optimized for searching ●Lucene Indexing o Well-known inverted index representation o Keeping adjacent non-inverted data on a per- document basis ●Key Point o Lucene only indexes Strings Convert whatever file format we have into something Lucene can use Indexing
36
Indexing with Lucene ●Overview o Fast: over 200 GB/hour o Incremental and “near-realtime” o Multi-threaded o Beyond full-text: numbers, dates, binary,... o Customize what is indexed (“analysis”) o Customize index format (“codecs”)
37
Indexing ●Document Model o A flat ordered list of fields with content o Fields have name, content data, float weight, and other attributes o Does not need to have a unique identifier
38
Indexing ●Store terms and documents in arrays
39
Indexing ●Insertions? o Insertion = write a new segment o Merge segments when there are too many of them o concatenate docs, merge terms, dicts and postings lists (merge sort!)
40
Indexing ●Deletions? o Deletion = turn a bit off o Ignore deleted documents when searching and merging (reclaims space) o Merge policies favor segments with many deletions
41
●Updates require writing a new segment o Single-doc updates are costly, bulk updates prefered o Writes are sequential ●Segments are never modified in place o Filesystem-cache-friendly o Lock-free! ●Terms are deduplicated o Saves space for high-freq terms ●Docs are uniquely identified by an ord o Useful for cross-API communication o Lucene can use several indexes in a single query ●Terms are uniquely idendified by an ord o Important for sorting: compare longs, not strings o Important for faceting Indexing
42
●Term vectors o Per-document inverted index o Useful for more-like-this ( 연관 검색어 ) o Sometimes used for highlighting
43
Indexing ●Numeric/binary doc values o Per doc and per field single numeric values o Useful for sorting and custom scoring o Norms are numeric doc values
44
Indexing ●Sorted (set) doc values o Original-enabled per-doc and per-field values Sorted: single-valued, useful for sorting Sorted set: multi-valued, useful for faceting
45
Indexing ●Stored fields vs Doc values o Optimized for different access patterns get many field values for a few docs: stored fields get a few field values for many docs: doc values
46
Indexing ●Lucene APIs
47
Querying
48
●Lucene Query Parser converts strings into Java objects that can be used for searching ●Qeury objects can also be constructed programmatically ●Native support for many types of queries o Keyword o Phrase o Wildcard o Many more
49
Core Searching classes ●IndexSearcher ●Term o Basic unit for searching, consists of the field and the value of that field ●Query o TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery, FilteredQuery, and SpanQuery ●Hits o Simple container of pointers to ranked search results
50
Types of Queries 1.TermQuery a. Useful for retrieving documents by a key b. When the expression consists of a single word 2.PrefixQuery a. Matches documents containing terms beginning with a specified string b. When it ends with an asterisk(*) in query expressions 3.RangeQuery a. Facilitates searches from a starting term through an ending term
51
Types of Queries 4.BooleanQuery o A container of Boolean clauses o A clause is a subquery that can be optional, required, or prohibited 5.PhraseQuery o An index contains positional information of terms o Uses this information to locate documents where terms are within a certain distance of one another 6.FuzzyQuery o Matches terms similar to a specified term
52
Querying ●Support a variety of query options o Ability to filter, page, and sort results o Pseudo relevance feedback ●Over 50 different kinds of query representations ●Several query parsers ●A query parsing framework
53
Various Types of Queries
54
Analysis and Search Relevancy
55
Lucene Tutorial 1.Download Lucene from http://lucene.apache.org/java http://lucene.apache.org/java 2.Write Code a. Indexing Side i. Write code to add Documents to index b. Search Side i. Write code to transform user query into Lucene Query instances ii. Submit Query to Lucene to search iii. Display results
56
Basic Application
57
1.http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and- lemmatization-1.htmlhttp://nlp.stanford.edu/IR-book/html/htmledition/stemming-and- lemmatization-1.html 2.http://www.lucenetutorial.com/index.htmlhttp://www.lucenetutorial.com/index.html 3.http://trijug.org/downloads/TriJug-11-07.pdfhttp://trijug.org/downloads/TriJug-11-07.pdf 4.https://lucene.apache.org/core/https://lucene.apache.org/core/ 5.https://fosdem.org/2015/schedule/event/apache_lucene_5/attachments/slid es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdfhttps://fosdem.org/2015/schedule/event/apache_lucene_5/attachments/slid es/750/export/events/attachments/apache_lucene_5/slides/750/Uwe_Schi ndler___Apache_Lucene_5.pdf 6.http://www.slideshare.net/nitin_stephens/lucene-basicshttp://www.slideshare.net/nitin_stephens/lucene-basics 7.http://www.slideshare.net/lucenerevolution/what-is- inaluceneagrandfinal?from_action=savehttp://www.slideshare.net/lucenerevolution/what-is- inaluceneagrandfinal?from_action=save References
58
Apache Solr
59
Relationship between Lucene & Solr ●Engine & Car o Lucene A programmatic library which you can't use as-is o Solr A complete application which you can use out-of-box
60
Solr Overview ●Solr? o Web application o Enterprise search platform built on Lucene o Highly reliable, scalable, fault tolerant ●Solr is not a HTTP wrapper of Lucene o It adds many functionalities to Lucene o Some features of Solr are implemented before they are available in Lucene
61
Solr vs. Lucene ●Solr uses Lucene library, but extends it o Data-driven schemaless mode o Faceted search and filtering o Geospatial search o Performance optimizations o Monitoring o Rich document parsing o and so on...
62
Solr Functionality ●Advanced full-text search ●Scalability & Fault tolerance ●Open interfaces ●Administration interfaces ●Easy monitoring ●Easy configuration ●Near real-time indexing ●Extensible plugins
63
Scaling ●On distributed systems, Solr provides... o High scalability o Fault tolerance ●Built on Apache Zookeeper o Coordinator for distributed systems o Centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
64
Open Interfaces ●REST-like API o Invoke diverse operations via HTTP requests ●XML, JSON, CSV, binary format o Put data with these formats o Receive data in these formats ●Easy integration with any language
65
Web Interfaces ●Provides administrative and monitoring features
66
Web Interfaces (cont.) ●Provides querying interfaces ●Provides various querying options
67
Solr Query ●Solr query supports… o keyword matching o wildcard matching o proximity matching o range search o assigning different weights on search conditions o function query
68
Demo
69
●RDBMS vs. Text search platform o Does one size fit all? o Comparison on features and performances between two database systems o MySQL(RDBMS) vs. Solr(Text search platform)
70
MySQL vs. Solr ●MySQL o RDBMS used for general purposes ●Solr o Search platform, targeting only on text retrieval ●Questions o Will be the performance difference between two systems significant? o Does Solr have other advantages over traditional DBMSs?
71
Settings ●Yelp review dataset o https://www.yelp.com/dataset_challenge/dataset https://www.yelp.com/dataset_challenge/dataset o Dataset we used in course projects o Served as JSON form o Over 1.5 million reviews
72
Importing Data ●There is no direct method to import JSON data into MySQL o We have to insert them article-by-article, o Or load them into DB after converting to CSV file ●In Solr, parsing rich document is available o XML, JSON, PDF, Word, etc. o Powered by Apache Tika
73
Importing Data ●In MySQL, … o Convert to CSV o And load
74
Importing Data ●In Solr… o Single line command
75
Test #1: Matching Single Term ●Retrieve documents that have the word ‘cuisine’ in their contents ●In SQL, o SELECT * FROM review WHERE text LIKE '%cuisine%'; ●In Solr query, o HTTP request to ‘/select’, with parameter o q=text:cuisine
76
Test #1: Matching Single Term ●Video clip: MySQL
77
Test #1: Matching Single Term Video clip: Solr
78
Test #2: Matching with Conditions ●Retrieve documents with below conditions o contains ‘meal’ in text o contains ‘coffee’ in text o does not contain ‘china’ in text o star rating is over 3 o written before 2012 o sort by date, ascending order o retrieve up to 500 documents
79
Test #2: Matching with Conditions ●In MySQL, SELECT * FROM review WHERE text LIKE '%meal%' AND text LIKE '%coffee%' AND text NOT LIKE '%china%' AND stars > 3 AND date < '2012-01-01 00:00:00.000' ORDER BY date ASC LIMIT 500; ●In Solr, HTTP request to ‘/select’, with parameter q=text:meal AND text:coffee AND -text:china AND stars:{3 TO *} AND date:{* TO 2012-01-01T00:00:00Z} sort=date ASC rows=500
80
Test #2: Matching with Conditions ●Video clip: MySQL
81
Test #2: Matching with Conditions ●Video clip: Solr
82
Test #3: Proximity Search ●Proximity search o Matching term occurrences within a specified distance o e.g. ‘hotel’ and ‘california’ within distance 4 This hotel is located in California Welcome to the Hotel California, such a lovely place
83
Test #3: Proximity Search ●In MySQL o Can we do it…? ●In Solr o text:"hotel california"~4 o or, {!surround} text:3w(hotel, california)
84
Test #4: Faceted Search ●Solr supports faceted search feature o This allows users to explore information by applying multiple filters o Dynamic clustering of search results into categories that let users drill into search results by any value in any field. ●Popular technique for commercial applications
85
Test #4: Faceted Search ●When to use? o I want to find a specific item o but it is hard to define what I want to find ●By faceted search, we can remove irrelevant candidates, by applying filters
86
Test #4: Faceted Search ●Define filters o Star rating o Date written From 2006-01-01 To 2010-01-01 By 3-month interval ●Query o GET request to ‘/select’, with parameters q=* facet=true facet.field=stars facet.date=date f.date.facet.date.start= 2006-01-01T00:00:00Z f.date.facet.date.end= 2010-01-01T00:00:00Z f.date.facet.date.gap= +3MONTH
87
Test #4: Faceted Search
88
●q=* ●facet=true ●facet.field=stars ●facet.date=date ●f.date.facet.date.start=2006-01-01T00:00:00Z ●f.date.facet.date.end=2010-01-01T00:00:00Z ●f.date.facet.date.gap=+3MONTH ●fq=stars:4 ●fq=date:{2007-07-01T00:00:00Z TO 2007-07-01T00:00:00Z+3MONTH}
89
Test #4: Faceted Search ●In MySQL? o SELECT stars, COUNT(*) FROM review GROUP BY stars o SELECT YEAR(date), (CASE WHEN MONTH(date) >= 1 AND MONTH(date) = 4 AND MONTH(date) = 7 AND MONTH(date) = 10 AND MONTH(date) = '2006-01-01 00:00:00' AND date < '2010-01-01 00:00:00' GROUP BY YEAR(date), period; ●Long and messy!
90
Test #5: Language Analysis ●For an efficient text retrieval, language analysis techniques are used o Stemming o Synonyms o Stopword removal o etc.
91
Test #5: Language Analysis ●In Solr, we can apply filters on index and query o Some filters are applied automatically in default, by language o For English, Porter stemmer is used defaultly Of course, we can change a stemmer to use
92
Test #5: Language Analysis ●‘a nice hotel’, ‘hotel with niceness’ o They will give us same results, due to stemming and stopword removal
93
Test #5: Language Analysis ●Are these possible in MySQL? o Almost impossible by MySQL itself o Should be done in application-level, not DB-level
94
Results ●RDBMS vs. Text search platform o Response time Using indices, text search platform retrieved documents faster o Rich search functionalities Text search platform gives us rich functionalities such as proximity search and faceted search o Language Analysis Text search platform applies filters on index and query to find synonymy terms, terms experienced inflection, etc.
95
Conclusions ●For text retrieval, Solr outperforms MySQL o What if updates occur frequently? o What if we need to find documents not by words? o What if we need complex join operations? ●Does one size fit all? o RDBMS is a possible good choice for general purposes, but there exist systems for a specific domain ●We have to select a suitable system o If you are a database engineer who has to build text retrieval system, text retrieval engine might be a good choice
96
References 1.http://lucene.apache.org/solr/features.htmlhttp://lucene.apache.org/solr/features.html 2.https://www.apache.org/dyn/closer.cgi/lucen e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdfhttps://www.apache.org/dyn/closer.cgi/lucen e/solr/ref-guide/apache-solr-ref-guide- 5.1.pdf 3.https://lucidworks.com/blog/faceted-search- with-solr/https://lucidworks.com/blog/faceted-search- with-solr/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.