Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Classification 2004. 4. 26.. Contents Introduction of Textual Data Classification (Categorization) Related Technologies  Feature selection  Text.

Similar presentations


Presentation on theme: "Text Classification 2004. 4. 26.. Contents Introduction of Textual Data Classification (Categorization) Related Technologies  Feature selection  Text."— Presentation transcript:

1 Text Classification 2004. 4. 26.

2 Contents Introduction of Textual Data Classification (Categorization) Related Technologies  Feature selection  Text classification

3 Introduction (1/2) Text classification (categorization)  Sorting new items into existing structures  general topic hierarchies  email folders  general file system  Information filtering/push  Mail filtering(spam vs.not )  Customized Push service

4 Introduction (2/2) Difference with data mining  Analyze both raw data and textual information at the same time  Require complicated feature selection technologies  May include linguistic, lexical, and contextual techniques

5 Classifier unknown documents sample documents 1. learning 2. classification A. fun B. business C.private Text classification 예 : e-mail

6 Process Construction of vocabulary  optional Extraction  Keep incoming documents in the system Parsing  Stemming  Vector model, bag-of-words Feature selection (reduction) Learning  Off-line process: Build model parameters Categorization  On-line process Re-learning  On-line process

7 Extraction Databases  Documents  Incoming/Training/Categorized documents  Dictionary  Stopwords  조사, 어미  …

8 Stemming Table Lookup  검색어와 관련된 모든 어간을 테이블 기록 N-gram stemmer 접사 제거  어근 추출을 위해 접두사, 접미사, 어미 등을 제거  Porter 알고리즘  ‘ies’  ‘y’  ‘es’  ‘s’  ‘s’  NULL

9 Extraction 색인어  주로 명사 ( 구 )  그외 형용사 ( 구 ), 동사 ( 구 ) 색인방법  통계적 기법  단어 출현 통계량 (term frequency) 사용  언어학적 기법  형태소 분석, 구문 분석

10 Extraction 한글 문서의 특징  띄어쓰기가 자유로움  복합명사 분해 문제  대학생선교회  대학 + 생선 + 교회 or 대학생 + 선교회  용언의 변화, 축약  음절 분석 필요  불용어 처리  맞춤법 처리 색인어로 적당한 한글의 격틀  명사 : ex) 정보  명사 + 명사 : ex) 정보검색  명사 + 조사 + 명사 : ex) 정보의 검색

11 Feature Selection (reduction) : Curse of dimensionality Removal of stopwords Feature Selection  Zipf’s Law  DF (document frequency)-based  x 2 Statistics-based  Mutual Information  …

12 Feature Selection (reduction) : Curse of dimensionality Stopwords

13 Feature Selection (reduction) : Curse of dimensionality Zipf’s Law

14 Feature Selection (reduction) : Curse of dimensionality x 2 statistics-based C/C tpr /tqs

15 Parsing: representing documents Vector Representation - term frequency - document frequency - weights

16 Classification Model : machine learning approach Learner Classifier Observed Training documents Unknown documents Model(hypothesis) Parameters Categorized documents

17 Classification Model : machine learning approach Na ï ve Bayesian Classification Nearest Neighbor Classification q

18 Classification 예 환자본인의 유전자를 이용, 배아를 만든 후 이를 이용해 실험실에서 건강한 세포를 배양시켜 환자에 다시 주입하는 이른바 치료복제법이 실험을 통해 입증되기는 이번이 세계최초라고 연구진은 주장했는데 이 방법은 주입된 세포에 대한 인체의 거부 반응이 없어 그동안 의학계의 관심을 끌어왔다 환자 본인 환자본인 유전자 이용 배아 이용 실험실 건강 세포 배양 환자 주입 치료복제법 실험 입증 이번 세계 최초 세계최초 연구진 주장 방법 주입 세포 인체 거부 반응 의학계 관심 수의학  0.191149 의학, 생명공학, 약학  0.134847 치의학  0.114641 생물, 미생물  0.109833 성  0.099062 질병, 증상, 죽음  0.084554...

19 Learning the text classifier Before system starts  Define category (class, topic)  Learning representative documents for each defined category During system operation  Incremental learning for each classifier  Define new categories by clustering uncategorized documents

20 Machine Learning Methods Similarity-based  K-Nearest Neighbor Decision Trees Statistical Learning:  Naïve Bayes  Bayes Nets Support Vector Machines Artificial Neural Networks... Others  Hierarchical classification  Expectation-Maximization technique  Variants of Boosting  Active learning


Download ppt "Text Classification 2004. 4. 26.. Contents Introduction of Textual Data Classification (Categorization) Related Technologies  Feature selection  Text."

Similar presentations


Ads by Google