Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering 2004. 11.

Slides:



Advertisements
Similar presentations
Data Mining Tools Overview Business Intelligence for Managers.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Classification: Alternative Techniques
Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Data Mining Techniques: Clustering
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Presented by Zeehasham Rasheed
Lecture #1COMP 527 Pattern Recognition1 Pattern Recognition Why? To provide machines with perception & cognition capabilities so that they could interact.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Recommender systems Ram Akella November 26 th 2008.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
CIS 674 Introduction to Data Mining
Introduction to machine learning
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Chun-Hung Chou
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
김한준 서울시립대학교 Data Mining Lab., Univ. of Seoul, Copyright ® 2008.
Recent Trends in Text Mining Girish Keswani
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Data Mining By Dave Maung.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Classification Techniques: Bayesian Classification
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Text Classification Contents Introduction of Textual Data Classification (Categorization) Related Technologies  Feature selection  Text.
27-18 września Data Mining dr Iwona Schab. 2 Semester timetable ORGANIZATIONAL ISSUES, INDTRODUCTION TO DATA MINING 1 Sources of data in business,
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Data Mining and Text Mining. The Standard Data Mining process.
Recent Trends in Text Mining
Information Organization: Overview
Machine Learning Clustering: K-means Supervised Learning
School of Computer Science & Engineering
Data Mining 101 with Scikit-Learn
Special Topics in Data Mining Applications Focus on: Text Mining
Data Warehousing and Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
Prepared by: Mahmoud Rafeek Al-Farra
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Information Retrieval
Information Organization: Overview
Presentation transcript:

Text Mining Text Classification Text ClusteringText Mining Text Classification Text Clustering

Contents Introduction Introduction Related Technologies Related Technologies Feature selection Feature selection Text classification Text classification Text clustering Text clustering

Introduction (1/2) Text classification (categorization) Text classification (categorization) Sorting new items into existing structures Sorting new items into existing structures general topic hierarchies general topic hierarchies folders folders general file system general file system Information filtering/push Information filtering/push Mail filtering(spam vs.not) Mail filtering(spam vs.not) Customized Push service Customized Push service

Categorization Document Categorization

Clustering (topic discovery) Document Clustering

Introduction (2/2) Difference with data mining Difference with data mining Analyze both raw data and textual information at the same time Analyze both raw data and textual information at the same time Require complicated feature selection technologies Require complicated feature selection technologies May include linguistic, lexical, and contextual techniques May include linguistic, lexical, and contextual techniques

Classifier unknown documents sample documents 1. learning 2. classification A. fun B. business C.private Text classification 예 :

Process Construction of vocabulary Construction of vocabulary optional optional Extraction Extraction Keep incoming documents in the system Keep incoming documents in the system Parsing Parsing Stemming Stemming Vector model, bag-of-words Vector model, bag-of-words Feature selection (reduction) Feature selection (reduction) Learning Learning Off-line process: Build model parameters Off-line process: Build model parameters Categorization Categorization On-line process On-line process Re-learning Re-learning On-line process On-line process

Extraction Databases Databases Documents Documents Incoming/Training/Categorized documents Incoming/Training/Categorized documents Dictionary Dictionary Stopwords Stopwords 조사, 어미 조사, 어미 …

Stemming Table Lookup Table Lookup 검색어와 관련된 모든 어간을 테이블 기록 검색어와 관련된 모든 어간을 테이블 기록 N-gram stemmer N-gram stemmer 접사 제거 접사 제거 어근 추출을 위해 접두사, 접미사, 어미 등을 제거 어근 추출을 위해 접두사, 접미사, 어미 등을 제거 Porter 알고리즘 Porter 알고리즘 ‘ies’  ‘y’ ‘ies’  ‘y’ ‘es’  ‘s’ ‘es’  ‘s’ ‘s’  NULL ‘s’  NULL

Extraction 색인어 색인어 주로 명사 ( 구 ) 주로 명사 ( 구 ) 그외 형용사 ( 구 ), 동사 ( 구 ) 그외 형용사 ( 구 ), 동사 ( 구 ) 색인방법 색인방법 통계적 기법 통계적 기법 단어 출현 통계량 (term frequency) 사용 단어 출현 통계량 (term frequency) 사용 언어학적 기법 언어학적 기법 형태소 분석, 구문 분석 형태소 분석, 구문 분석

Extraction 한글 문서의 특징 한글 문서의 특징 띄어쓰기가 자유로움 띄어쓰기가 자유로움 복합명사 분해 문제 복합명사 분해 문제 대학생선교회  대학 + 생선 + 교회 or 대학생 + 선교회 대학생선교회  대학 + 생선 + 교회 or 대학생 + 선교회 용언의 변화, 축약 용언의 변화, 축약 음절 분석 필요 음절 분석 필요 불용어 처리 불용어 처리 맞춤법 처리 맞춤법 처리 색인어로 적당한 한글의 격틀 색인어로 적당한 한글의 격틀 명사 : ex) 정보 명사 : ex) 정보 명사 + 명사 : ex) 정보검색 명사 + 명사 : ex) 정보검색 명사 + 조사 + 명사 : ex) 정보의 검색 명사 + 조사 + 명사 : ex) 정보의 검색

Feature Selection (reduction) : Curse of dimensionality Removal of stopwords Removal of stopwords Feature Selection Feature Selection Zipf’s Law Zipf’s Law DF (document frequency)-based DF (document frequency)-based x 2 Statistics-based x 2 Statistics-based Mutual Information Mutual Information …

Feature Selection (reduction) : Curse of dimensionality Stopwords Stopwords

Feature Selection (reduction) : Curse of dimensionality Zipf’s Law Zipf’s Law

Feature Selection (reduction) : Curse of dimensionality x 2 statistics-based x 2 statistics-basedC/Ctpr /tqs

Parsing: representing documents Vector Representation - term frequency - document frequency - weights

Classification Model : machine learning approach Learner Classifier Observed Training documents Unknown documents Model(hypothesis) Parameters Categorized documents

Classification Model : machine learning approach Na ï ve Bayesian Classification Na ï ve Bayesian Classification Nearest Neighbor Classification Nearest Neighbor Classification q

Classification 예 환자본인의 유전자를 이용, 배아를 만든 후 이를 이용해 실험실에서 건강한 세포를 배양시켜 환자에 다시 주입하는 이른바 치료복제법이 실험을 통해 입증되기는 이번이 세계최초라고 연구진은 주장했는데 이 방법은 주입된 세포에 대한 인체의 거부 반응이 없어 그동안 의학계의 관심을 끌어왔다 환자 본인 환자본인 유전자 이용 배아 이용 실험실 건강 세포 배양 환자 주입 치료복제법 실험 입증 이번 세계 최초 세계최초 연구진 주장 방법 주입 세포 인체 거부 반응 의학계 관심 수의학  의학, 생명공학, 약학  치의학  생물, 미생물  성  질병, 증상, 죽음 

Learning the text classifier Before system starts Before system starts Define category (class, topic) Define category (class, topic) Learning representative documents for each defined category Learning representative documents for each defined category During system operation During system operation Incremental learning for each classifier Incremental learning for each classifier Define new categories by clustering uncategorized documents Define new categories by clustering uncategorized documents

Machine Learning based approach (Basic architecture)

Machine Learning Methods Similarity-based Similarity-based K-Nearest Neighbor K-Nearest Neighbor Decision Trees Decision Trees Statistical Learning: Statistical Learning: Naïve Bayes Naïve Bayes Bayes Nets Bayes Nets Support Vector Machines Support Vector Machines Artificial Neural Networks Artificial Neural Networks Others Others Hierarchical classification Hierarchical classification Expectation-Maximization technique Expectation-Maximization technique Variants of Boosting Variants of Boosting Active learning Active learning

Na ï ve Bayes Text Classifier Classification model of NB classifiers - Class prior estimate - Word probability estimate Class prior estimateWord probability estimate

Uses of Clustering in IR Clustering as Representation (abstraction) Clustering as Representation (abstraction) Clustering is unsupervised learning Clustering is unsupervised learning of the underlying structure, classes of the underlying structure, classes Clustering can be used to transform representations Clustering can be used to transform representations documents are represented by class membership as well as individual terms documents are represented by class membership as well as individual terms Can be viewed as dimensionality reduction Can be viewed as dimensionality reduction especially term clustering (e.g., word variant clusters) especially term clustering (e.g., word variant clusters) Clustering for Browsing Clustering for Browsing Clustering has been proposed as a technique for organizing documents for browsing, interaction and visualization Clustering has been proposed as a technique for organizing documents for browsing, interaction and visualization constructing hypertext constructing hypertext clustering the results of searches clustering the results of searches iterative clustering of the collection (e.g, Scatter/Gather) iterative clustering of the collection (e.g, Scatter/Gather) clustering the web clustering the web Also has been used to group terms for browsing Also has been used to group terms for browsing automatic thesauri automatic thesauri topic summaries topic summaries Clustering for topic discovery Clustering for topic discovery

Introduction Text clustering Text clustering Summarization of large text data Summarization of large text data Discovering new categories Discovering new categories

Abstraction of a set of documents Document within a cluster  “ relevant ”

Information Retrieval (browsing) Clustering of Query Results (Scatter/Gather) Clustering of Query Results (Scatter/Gather) Scatter & Gather

Clustering for Topic discovery (Evolution of topic hierarchy) “Movie & Film”... A “Movie & Film” “Plays” “Film Festivals”... “Screen Plays” “Movie” “Genres” “Film Festival”... “Horror” “Science Fiction” B Reorganization New topic discovery Concep t drift Change of viewpoin t

Clustering Algorithm Two general methodologies Two general methodologies Hierarchical Hierarchical pairs of items or clusters are successively linked to produce larger clusters (agglomerative) pairs of items or clusters are successively linked to produce larger clusters (agglomerative) or start with the whole set as a cluster and successively divide sets into smaller partitions (divisive) or start with the whole set as a cluster and successively divide sets into smaller partitions (divisive) Non-hierarchical - divide a set of N items into M clusters (top-down) Non-hierarchical - divide a set of N items into M clusters (top-down) Graph Graph partitioning partitioning

Clusters Supervised Clustering (for Topic Discovery) Clustering Document Collection A’B’C’D’E’ Human Knowledge Topics (categories)