1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Text Databases Text Types
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Data Mining: Concepts and Techniques Mining Text Data
Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
K nearest neighbor and Rocchio algorithm
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Modeling Modern Information Retrieval
Hinrich Schütze and Christina Lioma
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Lecture 18 Text Data Mining MW 4:00PM-5:15PM Dr. Jianjun Hu CSCE822 Data Mining and Warehousing University of South.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Presented By Amarjit Datta
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Overview on Web Mining and Recommendation June 13, CENG 770.
Plan for Today’s Lecture(s)
Text & Web Mining 9/22/2018.
Multimedia Information Retrieval
Representation of documents and queries
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Presentation transcript:

1 Data Mining: Text Mining

2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms  Documents Frequency Matrices Information Retrieval Models: Boolean Model Vector Model Probabilistic Model

3 Boolean Model Consider that index terms are either present or absent in a document As a result, the index term weights are assumed to be all binaries A query is composed of index terms linked by three connectives: not, and, and or e.g.: car and repair, plane or airplane The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query

4 Keyword-Based Retrieval A document is represented by a string, which can be identified by a set of keywords Queries may use expressions of keywords E.g., car and repair shop, tea or coffee, DBMS but not Oracle Queries and retrieval should consider synonyms, e.g., repair and maintenance Major difficulties of the model Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining Polysemy: The same keyword may mean different things in different contexts, e.g., mining

5 Similarity-Based Retrieval in Text Data Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. Basic techniques Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently E.g., a, the, of, for, to, with, etc. Stop lists may vary when document set varies

6 Similarity-Based Retrieval in Text Data Word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word t i in document d i Usually, the ratio instead of the absolute number of occurrences is used Similarity metrics: measure the closeness of a document to a query (a set of keywords) Relative term occurrences Cosine similarity:{0,1}

7 Indexing Techniques indexing Maintains two hash- or B+-tree indexed tables: document_table (direct index): a set of document records term_table (inverted index): a set of term records, Answer query: Find all docs associated with one or a set of terms + easy to implement – do not handle well synonymy and polysemy, and posting lists could be too long (storage could be very large) Signature file Associate a signature with each document A signature is a representation of an ordered list of terms that describe the document Order is obtained by frequency analysis, stemming and stop lists

8 Text Classification Motivation Automatic classification for the large number of on-line text documents (Web pages, s, corporate intranets, etc.) Classification Process Data preprocessing Definition of training set and test sets Creation of the classification model using the selected classification algorithm Classification model validation Classification of new/unknown text documents Text document classification differs from the classification of relational data Document databases are not structured according to attribute- value pairs

9 Text Classification K-Nearest Neighbors Find the top k most similar documents in the training set to a given new document Apply majority voting among the top k documents

10 Text Categorization Pre-given categories and labeled document examples (Categories may form hierarchy) Classify new documents A standard classification (supervised learning ) problem Categorization System … Sports Business Education Science … Sports Business Education

Text mining Clustering Automatically group related documents based on their contents No predetermined training sets or taxonomies Generate a taxonomy at runtime Association Analysis Process Collect sets of keywords or terms that occur frequently together and then find the association or correlation relationships among them 11

12 Vector Space Model Represent a doc by a term vector Term: basic concept, e.g., word or phrase Each term defines one dimension N terms define a N-dimensional space Element of vector corresponds to term weight E.g., d = (x 1,…,x N ), x i is “importance” of term i New document is assigned to the most likely category based on vector similarity.

13 Vector Space Model Documents and user queries are represented as m-dimensional vectors, where m is the total number of index terms in the document collection. The degree of similarity of the document d with regard to the query q is calculated as the correlation between the vectors

14 VS Model: Illustration Java Microsoft Starbucks C2C2 Category 2 C1C1 Category 1 C3C3 Category 3 new doc

15 What VS Model Does Not Specify How to select terms to capture “basic concepts” Word stopping e.g. “a”, “the”, “always”, “along” Word stemming e.g. “computer”, “computing”, “computerize” => “compute” Latent semantic indexing How to assign weights Not all words are equally important: Some are more indicative than others e.g. “linear algebra” vs. “mathematics” How to measure the similarity

16 How to Assign Weights Two-fold heuristics based on frequency TF (Term frequency) More frequent within a document  more relevant to semantics IDF (Inverse document frequency) Less frequent among documents  more discriminative

17 TF Weighting Weighting: More frequent => more relevant to topic e.g. “query” vs. “commercial” Raw TF= f(t,d): how many times term t appears in doc d Normalization: Document length varies => relative frequency preferred e.g., Maximum frequency normalization

18 IDF Weighting Ideas: Less frequent among documents  more discriminative Formula: n — total number of docs k — # docs with term t appearing (the DF document frequency)

19 TF-IDF Weighting TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t) Freqent within doc  high tf  high weight Selective among docs  high idf  high weight Recall VS model Each selected term represents one dimension Each doc is represented by a feature vector Its t-term coordinate of document d is the TF-IDF weight Many complex and more effective weighting variants exist in practice

20 How to Measure Similarity? Given two document Similarity definition dot product normalized dot product (or cosine)

21 Illustrative Example text mining travel map search engine govern president congress IDF(faked) doc12(4.8) 1(4.5) 1(2.1) 1(5.4) doc21(2.4 ) 2 (5.6) 1(3.3) doc3 1 (2.2) 1(3.2) 1(4.3) newdoc1(2.4) 1(4.5) doc3 text mining search engine text travel text map travel government president congress doc1 doc2 …… To whom is newdoc more similar? Sim(newdoc,doc1)=4.8* *4.5 Sim(newdoc,doc2)=2.4*2.4 Sim(newdoc,doc3)=0

Privacy-Preserving Text Mining Cloud Computing Most exciting shift in computing Great efficiency and minimum cost Motivation to outsource Security & Privacy Issues Primary obstacle Sensitive data must be protected

Motivation Security & Privacy Data encryption How to search? Not too meaningful as you cannot apply efficient search Searchable Encryption Prebuilt encrypted search index Trapdoors

Problem Definition Entities Set of users Semi-trusted server (honest but curious) Data owner Documents stored encrypted Encrypted searchable index Authorized users have access to trapdoors Search is done over the searchable index via queries generated using the trapdoors

The Big Picture Encrypted files Secure Index 2. Query 3. Ordered matching items Data Owner Users 1. Trapdoors Cloud Server

Privacy Requirements Query Privacy  terms that are searched for Document Privacy  features in docs Search Pattern  equality and/or common features between different queries Access Pattern  collection of documents that are accessed