Download presentation
Presentation is loading. Please wait.
Published byMariah Lindsey Modified over 9 years ago
2
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning Techniques
3
Ping-Tsun Chang Language Identification Classification Clustering Summerization Feature Selection Text Analysis
4
Ping-Tsun Chang Text Mining Text mining is about looking for patterns in natural language text –Natural Language Processing May be defined as the process of analyzing text to extract information from it for particular purposes. –Information Extraction –Information Retrieval
5
Ping-Tsun Chang Text Mining and Knowledge Management a recent study indicated that 80% of a company's information is contained in text documents –emails, memos, customer correspondence, and reports The ability to distil this untapped source of information provides substantial competitive advantages for a company to succeed in the era of a knowledge-based economy.
6
Ping-Tsun Chang Text Mining Applications Customer profile analysis –mining incoming emails for customers' complaint and feedback. Patent analysis –analyzing patent databases for major technology players, trends, and opportunities. Information dissemination –organizing and summarizing trade news and reports for personalized information services. Company resource planning – mining a company's reports and correspondences for activities, status, and problems reported.
7
Ping-Tsun Chang Text Categorization Problem Definition Text categorization is the problem of automatically assigned predefined categories to free text documents –Document classification –Web page classification –News classification
8
Ping-Tsun Chang Information Retrieval Full text is hard to process, but is a complete representation to document Logical view of documents Models –Boolean Model –Vector Model –Probabilistic Model Think text as patterns?
9
Ping-Tsun ChangEvaluation Retrieved Relevant a b c d
10
Ping-Tsun Chang Pattern Recognization Sensing Segmentation Classification Post-Processing Feature ExtractionDecision
11
Ping-Tsun Chang Pattern Classification f 1 f 2 C 1 C 2
12
Ping-Tsun Chang Machine Learning Using Computer help us to induction from complex and large amount of pattern data Bayesian Learning Instance-Based Learning –K-Nearest Neighbors Neural Networks Support Vector Machine
13
Ping-Tsun Chang Feature Selection (I) Information Gain
14
Ping-Tsun Chang Feature Selection (II) Mutual Information CHI-Square
15
Ping-Tsun Chang Weighting Scheme TF ‧ IDF
16
Ping-Tsun Chang Simility Evaluation Cosine-Like schema didi djdj
17
Ping-Tsun Chang Machine Learning Approaches: Baysian Classifier
18
Ping-Tsun Chang Machine Learning Approaches: kNN Classifier d ?
19
Ping-Tsun Chang Machine Learning Approaches: Support Vector Machine Basic hypotheses : Consistent hypotheses of the Version Space Project the original training data in space X to a higher dimension feature space F via a Mercel operator K
20
Ping-Tsun Chang Compare: SVM and traditional Leaners Traditional Leaner SVM access the hypothesis space! P(h) hypothesis P(h|D 1 ) hypothesis P(h|D 1^ D 2 ) hypothesis
21
Ping-Tsun Chang SVM Learning in Feature Spaces Example: XF
22
Ping-Tsun Chang Support Vector Machine (cont’d) Nonlinear –Example: XOR Problem Natural Language is Nonlinear! f 1 f 2 f 1 f 1 f 2
23
Ping-Tsun Chang Support Vector Machine (cont’d) Consistent hypothses Maximum margin Support Vector
24
Ping-Tsun Chang Statistical Learning Theory P(X)P(y|x) F(x) y y* x x Generator Supervisor Leaner
25
Ping-Tsun Chang Support Vector Machine Linear Discriminant Functions Linear discriminant space Hyperplane g(y)>1 y2y2 y1y1 g(y)<1
26
Ping-Tsun Chang Learning of Support Vector Machine Maxmize Margin Minimize ||a|| Optimal hyperplane
27
Ping-Tsun Chang Version Space Hypothesis Space H Version Space V H V
28
Ping-Tsun Chang Support Vector Machine Active Learning Why Support Vector Machine? –Text Categorization have large amount of data –Traditional Learning cause Over-Fitting –Language is complex and nonlinear Why Active Learning? –Labeling instance is time-consuming and costly –Reduce the need for labeled training instances
29
Ping-Tsun Chang Active Learning: History Text Classification [Rochio, 71] [Dumais, 98] Support Vector Machine [Vapnik,82] Text Classification Support Vector Machine [Joachims,98] [Dumais,98] Pool-Based Active Learning [Lewis, Gale ‘94] [McCallum, Nigrm ‘98] The Nature of Statistical Learning Theory [Vapnik, 95] Automated Text Categorization Using Support Vector Machine [Kwok, 98]
30
Ping-Tsun Chang Active Learning UPool-Based active learning have a pool U of unlabeled instances Active Lerner l have three components (f,q,X) –f: classifier x->{-1, 1} –q: querying function q(X), given a training instance labeled set X, decide which instance in U to query next. –X: training data, labeled.
31
Ping-Tsun Chang Active Learning (cont’d) Main difference: querying component q. How to choose the next unlabeled instance to query? Resulting Version Space
32
Ping-Tsun Chang Active Learner Active learner l* always queries instances whose corresponding hyperplanes in parameter space W halves the area of the current version space
33
Ping-Tsun Chang Experienments Bayesian Classifier
34
Ping-Tsun Chang Comparsion of Learning Methods 0102030405060 0.6 0.8 1 0.4 0.2 Precision Training Data Size SVM kNN NB NNet
35
Ping-Tsun Chang Conclusions Text-Mining extraction knowledge from text. Support Vector Machine is almost the best statistic-based machine learning method Natural Language Understanding is still a open problem Knowledge
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.