Text Based Information Retrieval - Text Mining PKB - Antonie.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Text Mining IS698 Min Song.  The Needs: -Find people as well as documents that can address my information need. -Promote collaboration and knowledge.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Data Mining By Archana Ketkar.
Classification.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Introduction to machine learning
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
1 An Introduction to Data Mining Hosein Rostani Alireza Zohdi Report 1 for “advance data base” course Supervisor: Dr. Masoud Rahgozar December 2007.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
1 1 Slide Introduction to Data Mining and Business Intelligence.
Introduction to Web Mining Spring What is data mining? Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web,
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Basic Data Mining Technique
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Data MINING Data mining is the process of extracting previously unknown, valid and actionable information from large data and then using the information.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Introduction to Text Mining By Soumyajit Manna 11/10/08.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data mining in web applications
Data Mining – Intro.
Machine Learning overview Chapter 18, 21
DATA MINING © Prentice Hall.
Subject Name: Data Warehousing and data Mining
Information Retrieval and Web Search
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Information Retrieval and Web Search
Multimedia Information Retrieval
Classification and Prediction
Sangeeta Devadiga CS 157B, Spring 2007
Data Warehousing and Data Mining
Classification and Prediction
Data Mining: Introduction
Data Mining: Concepts and Techniques
©Jiawei Han and Micheline Kamber
Information Retrieval and Web Search
Presentation transcript:

Text Based Information Retrieval - Text Mining PKB - Antonie

Background Human dificults to process huge information Computer can do better with matemathics –why don’t also use computer to process huge information? A Large text to find: –Terrorist attack on 1995? –Terrorist movement and bomb relation? Relates to Information Retreival, Data Mining and Text Mining

Terminology Data Mining A step in the knowledge discovery process consisting of particular algorithms (methods), produces a particular enumeration of patterns (models) over the data. Data Mining is a process of discovering advantageous patterns in data. Knowledge Discovery Process The process of using data mining methods (algorithms) to extract (identify) what is knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.

What kind of data in Data Mining? Relational Databases Data Warehouses Transactional Databases Advanced Database Systems –Object-Relational –Multimedia –Text –Heterogeneous and Distributed –WWW Data Mining Application: Market analysis Risk analysis and management Fraud detection and detection of unusual patterns (outliers) Text mining (news group, , documents) and Web mining Stream data mining

Knowledge Discovery

Required effort for each KDD Step Arrows indicate the direction we hope the effort should go.

What Is Text Mining? “The objective of Text Mining is to exploit information contained in textual documents in various ways, including …discovery of patterns and trends in data, associations among entities, predictive rules, etc.” (Grobelnik et al., 2001) “Another way to view text data mining is as a process of exploratory data analysis that leads to heretofore unknown information, or to answers for questions for which the answer is not currently known.” (Hearst, 1999) “The non trivial extraction of implicit, previously unknown, and potentially useful information from (large amount of) textual data”. textual (natural-language) data An exploration and analysis of textual (natural-language) data by automatic and semi automatic means to discover new knowledge.

Text Mining (2) “ previously unknown”What is “ previously unknown” information ? –Strict definition Information that not even the writer knows. –Lenient (lunak) definition Rediscover the information that the author encoded in the text e.g., Automatically extracting a product’s name from a web-page.

Information Retrieval –Indexing and retrieval of textual documents Information Extraction partial knowledge –Extraction of partial knowledge in the text Web Mining –Indexing and retrieval of textual documents and extraction of partial knowledge using the web Clustering –Generating collections of similar text documents Text Mining Methods

Text Mining Application Spam filtering News Feeds: Discover what is interesting Medical: Identify relationships and link information from different medical fields Marketing: Discover distinct groups of potential buyers and make suggestions for other products Industry: Identifying groups of competitors web pages Job Seeking: Identify parameters in searching for jobs

Information Retrieval (1) Given: –A source of textual documents –A well defined limited query (text based) Find: relevant –Sentences with relevant information –Extract the relevant information and ignore non-relevant information (important!) –Link related information and output in a predetermined format Example: news stories, s, web pages, photograph, music, statistical data, biomedical data, etc. Information items can be in the form of text, image, video, audio, numbers, etc.

Information Retrieval (2) 2 basic information retrieval (IR) process: –Browsing or navigation system User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found –Classical IR system: question answering system Query: question in natural language Answer: directly extracted from text of document collection Text Based Information Retrieval: –Information item (document) : Text format (written/spoken) or has textual description –Information need (query): Usually in text format

Classical IR System Process

Intelligent Information Retrieval meaning of words –Synonyms “buy” / “purchase” –Ambiguity “bat” (baseball vs. mammal) order of words in the query –hot dog stand in the amusement park –hot amusement stand in the dog park

Why Mine the Web? Enormous wealth of textual information on the Web. –Book/CD/Video stores (e.g., Amazon) –Restaurant information (e.g., Zagats) –Car prices (e.g., Carpoint) Lots of data on user access patterns –Web logs contain sequence of URLs accessed by users Possible to retrieve “previously unknown” information –People who ski also frequently break their leg. –Restaurants that serve sea food in California are likely to be outside San-Francisco

Mining the Web IR / IE System Query Documents source Ranked Documents 1. Doc1 2. Doc2 3. Doc3. Web Spider

What is Web Clustering ? Given: –A source of textual documents –Similarity measure e.g., how many words are common in these documents Clustering System Similarity measure Documents source Doc Find: Several clusters of documents that are relevant to each other

Text characteristics Large textual data base –Efficiency consideration over 2,000,000,000 web pages almost all publications are also in electronic form High dimensionality (Sparse input) –Consider each word/phrase as a dimension Dependency –relevant information is a complex conjunction of words/phrases e.g., Document categorization.Pronoun disambiguation

Text characteristics Ambiguity –Word ambiguity Pronouns (he, she …) “buy”, “purchase” –Semantic ambiguity The king saw the rabbit with his glasses. (? meanings) Noisy data Example: Spelling mistakes Not well structured text –Chat rooms “r u available ?” “Hey whazzzzzz up” –Speech

Text mining process Text preprocessing –Syntactic/Semantic text analysis Features Generation –Bag of words Features Selection –Simple counting –Statistics Text/Data Mining –Classification- Supervised learning –Clustering- Unsupervised learning Analyzing results

Part Of Speech (pos) tagging Find the corresponding pos for each word e.g., John (noun) gave (verb) the (det) ball (noun) Word sense disambiguation Context basedproximity basedContext based or proximity based Very accurate Parsing parse treeGenerates a parse tree (graph) for each sentence Each sentence is a stand alone graph Syntactic / Semantic text analysis

Feature Generation: Bag of words Text document is represented by the words it contains (and their occurrences) –e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”} –Highly efficient –Makes learning far simpler and easier –Order of words is not that important for certain applications Stemming: identifies a word by its root –Reduce dimensionality –e.g., flying, flew  fly –Use Porter Algorithm Stop words: The most common words are unlikely to help text mining –e.g., “the”, “a”, “an”, “you” …

Feature selection Reduce dimensionality –Learners have difficulty addressing tasks with high dimensionality Irrelevant features –Not all features help! e.g., the existence of a noun in a news article is unlikely to help classify it as “politics” or “sport” Use Weightening

training setGiven: a collection of labeled records (training set) attributes label –Each record contains a set of features (attributes), and the true class (label) modelFind: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible test set –A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Text Mining: Classification definition

Similarity Measures: Euclidean Distance Euclidean Distance if attributes are continuous Other Problem-specific Measures e.g., how many words are common in these documents similarity measureGiven: a set of documents and a similarity measure among documents Find: clusters such that: –Documents in one cluster are more similar to one another –Documents in separate clusters are less similar to one another Goal: correct –Finding a correct set of documents Text Mining: Clustering definition

Supervised learning (classification) labels –Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations –New data is classified based on the training set Unsupervised learning (clustering) –The class labels of training data is unknown –Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Supervised vs. Unsupervised Learning

class resultCorrect classification: The known label of test sample is identical with the class result from the classification model Accuracy ratio: the percentage of test set samples that are correctly classified by the model distance measureA distance measure between classes can be used –e.g., classifying “football” document as a “basketball” document is not as bad as classifying it as “crime”. Evaluation:What Is Good Classification?

Good clustering method: produce high quality clusters with... intra-class –high intra-class similarity inter-class –low inter-class similarity quality hiddenThe quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Evaluation: What Is Good Clustering?

Text Classification: An Example class Training Set Model Learn Classifier text Test Set

class text Decision Tree: A Text Example Yes English Yes No MarSt NO Married Single, Divorced Splitting Attributes Income YES NO > 80K< 80K The splitting attribute at a node is determined based on a specific Attribute selection algorithm

Decision tree –A flow-chart-like tree structure –Internal node denotes a test on an attribute –Branch represents an outcome of the test –Leaf nodes represent class labels or class distribution Decision tree generation consists of two phases: –Tree construction –Tree pruning noiseoutliersIdentify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute of the sample against the decision tree Classification by DT Induction

Text is tricky to process, but “ok” results are easily achieved text mining systemsThere exist several text mining systems –e.g., D2K - Data to Knowledge – IntelligenceAdditional Intelligence can be integrated with text mining –One may play with any phase of the text mining process Summary

scientific and statistical text mining methodsThere are many other scientific and statistical text mining methods developed but not covered in this talk. – – theoretical foundationsAlso, it is important to study theoretical foundations of data mining. –Data Mining Concepts and Techniques / J.Han & M.Kamber –Machine Learning, / T.Mitchell