Dept. of Computer Science University of Liverpool

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Information Retrieval in Practice
Search Engines and Information Retrieval
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Information Retrieval
Data Mining – Intro.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Siemens Big Data Analysis GROUP 3: MARIO MASSAD, MATTHEW TOSCHI, TYLER TRUONG.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Text-as-Data March 25, 2009.
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Text Mining: Challenges, Basics March.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Data mining in web applications
Information Retrieval in Practice
Information Retrieval in Practice
Queensland University of Technology
Data Mining – Intro.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Text Based Information Retrieval
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
ACS1803 Lecture Outline 2   DATA MANAGEMENT CONCEPTS Text, Ch. 3
Database Vocabulary Terms.
Text & Web Mining 9/22/2018.
Dept. of Computer Science University of Liverpool
Thanks to Bill Arms, Marti Hearst
Information Retrieval
CSE591: Data Mining by H. Liu
Data Warehousing and Data Mining
CS 430: Information Discovery
Data Mining Chapter 6 Search Engines
CSE 635 Multimedia Information Retrieval
Course Introduction CSC 576: Data Mining.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Panagiotis G. Ipeirotis Luis Gravano
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CS246: Information Retrieval
Dept. of Computer Science University of Liverpool
Web Mining Research: A Survey
Welcome! Knowledge Discovery and Data Mining
CSE591: Data Mining by H. Liu
Presentation transcript:

Dept. of Computer Science University of Liverpool COMP527: Data Mining COMP527: Data Mining M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 Introduction to Text Mining January 29, 2009 Slide 1

COMP527: Data Mining COMP527: Data Mining Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam Introduction to Text Mining January 29, 2009 Slide 2

Information Retrieval (IR)‏ What is IR? Typical IR Process Today's Topics COMP527: Data Mining Information Retrieval (IR)‏ What is IR? Typical IR Process Data Mining on Text Text Mining What is Text Mining? Typical Text Mining Process Applications Introduction to Text Mining January 29, 2009 Slide 3

What is Information Retrieval? COMP527: Data Mining IR is concerned with retrieving textual records, not data items like relational databases, nor (specifically) with finding patterns like data mining. Examples: SQL: Find rows where the text column LIKE “%information retrieval%” DM: Find a model in order to classify document topics. IR: Find documents with text that contains the words Information adjacent to Retrieval, Protocol or SRW, but not Google. Introduction to Text Mining January 29, 2009 Slide 4

What is Information Retrieval? COMP527: Data Mining IR focuses on finding the most appropriate or relevant records to the user's request. The supremacy of Google can be attributed primarily to its PageRank algorithm for ranking web pages in order of relevance to the user's query. $741.79 (on 2007-11-06, up from $471.80 on 2006-11-03) a share says this topic is important to understand! IR also focuses on finding these records as quickly as possible. Not only does Google find relevant pages, it finds them Fast, for many thousands (maybe millions?) of concurrent users. Introduction to Text Mining January 29, 2009 Slide 5

So is “Google” the answer to the question of “Information Retrieval”? IR = Google?? COMP527: Data Mining So is “Google” the answer to the question of “Information Retrieval”? No! Google has a good answer for how to search the web, but there are many more sources of data, and many more interesting questions. Many other examples, including: Library catalogues XML searching Distributed searching Query languages Introduction to Text Mining January 29, 2009 Slide 6

IR Processes: Discovery COMP527: Data Mining Research topics exist for each box and arrow! Search Engine User Need Query Information Introduction to Text Mining January 29, 2009 Slide 7

IR Processes: Ingestion COMP527: Data Mining Compare to the KDD process we looked at last time! Documents Search Engine Target Documents Records Preprocessed Documents Information Introduction to Text Mining January 29, 2009 Slide 8

What information do we need to store? Document Indexing COMP527: Data Mining What information do we need to store? Query: Documents containing Information and Retrieval but not Protocol Need to find which documents contain which words. Could perform this query using a document/term matrix: Introduction to Text Mining January 29, 2009 Slide 9

Also useful to know is the frequency of the term in the document. Document Indexing COMP527: Data Mining Also useful to know is the frequency of the term in the document. Each row in the matrix is a vector, and useful for data mining functions as the document has been reduced to a series of numbers rather than words. Our new matrix might look like: Introduction to Text Mining January 29, 2009 Slide 10

Common evaluation for IR relevance ranking: Precision and Recall COMP527: Data Mining Common evaluation for IR relevance ranking: Precision and Recall Precision: Number Relevant and Retrieved / Number Retrieved Recall: Number Relevant and Retrieved / Number Relevant F Score: recall * precision / ((recall + precision) / 2)‏ Ideal situation is all and only relevant documents retrieved. Introduction to Text Mining January 29, 2009 Slide 11

Format Processing: Extraction of text from different file formats Topics of Interest COMP527: Data Mining Format Processing: Extraction of text from different file formats Indexing: Efficient extraction/storage of terms from text Query Languages: Formulation of queries against those indexes Protocols: Transporting queries from client to server Relevance Ranking: Determining the relevance of a document to the user's query Metasearch: Cross-searching multiple document sets with the same query GridIR: Using the grid (or other massively parallel infrastructure) to perform IR processes Multimedia IR: IR techniques on multimedia objects, compound digital objects... Introduction to Text Mining January 29, 2009 Slide 12

Data Mining on Text COMP527: Data Mining All of the Data Mining functions can be applied to textual data, using term as the attribute and frequency as the value. Classification: Classify a text into subjects, genres, quality, reading age, ... Clustering: Cluster together similar texts Association Rule Mining: Find words that frequently appear together Finds texts that are frequently cited together Key challenge is the very large number of terms (eg the number of different words across all documents)‏ Introduction to Text Mining January 29, 2009 Slide 13

So, we've looked at Data Mining and IR... What's Text Mining then? COMP527: Data Mining So, we've looked at Data Mining and IR... What's Text Mining then? Good question. No canonical definition yet, but a similar definition for Data Mining could be applied: The non-trivial extraction of previously unknown, interesting facts from an (invariably large) collection of texts. So it sounds like a combination of IR and Data Mining, but actually the process involves many other steps too. Before we look at what actually happens, let's look at why it's different... Introduction to Text Mining January 29, 2009 Slide 14

Text Mining vs Data Mining COMP527: Data Mining Data Mining finds a model for the data based on the attributes of the items. The only attributes of text are the words that make up the text. As we looked at for IR, this creates a very sparse matrix. Even if we create that matrix, what sort of patterns could we find: Classification: We could classify texts into pre-defined classes (eg spam / not spam)‏ Association Rule Mining: Finding frequent sets of words. (eg if 'computer' appears 3+ times, then 'data' appears at least once)‏ Clustering: Finding groups of similar documents (IR?)‏ None of these fit our definition of Text Mining. Introduction to Text Mining January 29, 2009 Slide 15

Information Retrieval finds documents that match the user's query. Text Mining vs IR COMP527: Data Mining Information Retrieval finds documents that match the user's query. Even if we matched at a sentence level rather than document, all we do is retrieve matching sentences, we're not discovering anything new. The relevance ranking is important, but it still just matches information we already knew... it just orders it appropriately. IR (typically) treats a document as a big bag of words... but doesn't care about the meaning of the words, just if they exist in the document. IR doesn't fit our definition of Text Mining either. Introduction to Text Mining January 29, 2009 Slide 16

How would one find previously unknown facts from a bunch of text? Text Mining Process COMP527: Data Mining How would one find previously unknown facts from a bunch of text? Need to understand the meaning of the text! Part of speech of words Subject/Verb/Object/Preposition/Indirect Object Need to determine that two entities are the same entity. Need to find correlations of the same entity. Form logical chains: Milk contains Magnesium. Magnesium stimulates receptor activity. Inactive receptors cause Headaches --> Milk is good for Headaches. (fictional example!)‏ Introduction to Text Mining January 29, 2009 Slide 17

First we need to tag the text with the parts of speech for each word. Part of Speech Tagging COMP527: Data Mining First we need to tag the text with the parts of speech for each word. eg: Rob/noun teaches/verb the/article course/noun How could we do this? By learning a model for the language! Essentially a data mining classification problem -- should the system classify the word as a noun, a verb, an adjective, etc. Lots of different tags, often based on a set called the Penn Treebank. (NN = Noun, VB = Verb, JJ = Adjective, RB = Adverb, etc)‏ Introduction to Text Mining January 29, 2009 Slide 18

Now we need to discover the phrases and parts of each clause. Deep Parsing COMP527: Data Mining Now we need to discover the phrases and parts of each clause. Rob/noun teaches/verb the/article course/noun (Subject: Rob Verb:teaches (Object: the+course))‏ The phrase sections are often expressed as trees: ( TOP ( S ( NP ( DT This ) ( JJ crazy ) ( NN sentence ) ) ( VP ( VBD amused ) ( NP ( NNP Rob ) ) ( PP ( IN for ) ( NP ( DT a ) ( JJ few ) ( NNS minutes ) ) Introduction to Text Mining January 29, 2009 Slide 19

Rob: (Sanderson, Robert D. b.1976-07-20 Rangiora/New Zealand)‏ Entity Recognition COMP527: Data Mining Once we've parsed the text for linguistic structure, we need to identify the real world objects referred to. Rob teaches the course Rob: (Sanderson, Robert D. b.1976-07-20 Rangiora/New Zealand)‏ the course: Comp527 2006/2007, University of Liverpool, UK This is typically done via lookups in very large thesauri or 'ontologies', specific to the domain being processed (eg medical, historical, current events, etc.)‏ Introduction to Text Mining January 29, 2009 Slide 20

There will normally be a lot more text to parse: Fact Extraction COMP527: Data Mining There will normally be a lot more text to parse: Rob Sanderson, a lecturer at the University of Liverpool, teaches a masters level course on data mining (Comp527)‏ Rob is a lecturer Rob is at the University of Liverpool Rob teaches a course The course is called Comp527 The course is masters level The course is about data mining Introduction to Text Mining January 29, 2009 Slide 21

Data mining is about finding models to describe data sets. Correlation COMP527: Data Mining Rob Sanderson, a lecturer at the University of Liverpool, teaches a masters level course on data mining (Comp527)‏ Data mining is about finding models to describe data sets. --> The University of Liverpool has a course about finding models to describe data sets. (Not very interesting or novel in this case, but that's the process)‏ Introduction to Text Mining January 29, 2009 Slide 22

Search engines of all types are based on IR. Applications COMP527: Data Mining Search engines of all types are based on IR. But where would you use text mining? Most research so far is on medical data sets ... because this is the most profitable! If you could correlate facts to find a cure for cancer, you would be very VERY rich! So ... lots of people are trying to do just that for various values of 'cancer'. Also because of the wide availability of ontologies and datasets, in particular abstracts for medical journal articles (PubMed/Medline)‏ Introduction to Text Mining January 29, 2009 Slide 23

More application areas: News feeds Terrorism detection Applications COMP527: Data Mining More application areas: News feeds Terrorism detection Social sciences analysis Historical text analysis Corpus linguistics 'Net Nanny' filters etc. Introduction to Text Mining January 29, 2009 Slide 24