Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Similar presentations


Presentation on theme: "©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek"— Presentation transcript:

1 ©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

2 ©2012 Paula Matuszek Document Classification l Document classifying –Assign documents to pre-defined categories l Examples –Process email into work, personal, junk –Process documents from a newsgroup into “interesting”, “not interesting”, “spam and flames” –Process transcripts of bugged phone calls into “relevant” and “irrelevant” l Issues –Real-time? –How many categories/document? Flat or hierarchical? –Categories defined automatically or by hand?

3 ©2012 Paula Matuszek Document Classification l Usually –relatively few categories –well defined; a person could do task easily –Categories don't change quickly l Flat vs Hierarchy –Simple classification is into mutually-exclusive document collections –Richer classification is into hierarchy with multiple inheritance –broader and narrower categories –documents can go more than one place –merges into search interfaces such as Pubmed

4 ©2012 Paula Matuszek Classification -- Automatic l Statistical approaches l Set of “training” documents define categories –Underlying representation of document derived from text –BOW –features we discussed last time –Classification model is trained using machine learning –Individual documents classified applying the model Requires relatively little effort to create categories Accuracy heavily dependent on training examples Typically limited to flat, mutually exclusive categories

5 ©2012 Paula Matuszek Classification: Manual l Natural Language/linguistic techniques l Categories are defined by people –underlying representation of document is typically stream of tokens –category description contains –ontology of terms and relations –pattern-matching rules –individual documents classified by pattern-matching Defining categories can be very time-consuming Typically takes some experimentation to "get it right" Can handle much more complex structures

6 Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htmhttp://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm Automatic Classification Framework DocumentsPreprocessing Feature Extraction Feature filtering Applying classification algorithms Performance measure

7 Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htmhttp://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm Preprocessing Preprocessing: transform documents into a suitable representation for classification task –Remove HTML or other tags –Remove stop words –Perform word stemming (Remove suffix )

8 Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htmhttp://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm Feature Extraction Most crucial decision you’ll make! 1. Topic Words, phrases, ? 2. Author Stylistic features 3. Sentiment Adjectives, ? 4. Spam Specialized vocabulary Features must relate to categories

9 Based on http://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htmhttp://u.cs.biu.ac.il/~koppel/TextCateg2010Course.htm Feature Filtering Feature selection: remove non-informative terms from documents =>improve classification effectiveness =>reduce computational complexity

10 ©2012 Paula Matuszek Evaluation l We need to know how well our classification system is performing –Recall: % of documents in a class which are correctly classified as that class –r i = correctly classified as i / total which are i –Precision: % of documents classified in a class which are actually in that class –p i = correctly classified as i / total classified as i

11 ©2012 Paula Matuszek Corpus Documents classified into category Documents actually in category Correctly Categorized

12 ©2012 Paula Matuszek Combined Effectiveness Ideally, we want a measure that combines both precision and recall F 1 : 2pr / p+r If we accept everything, F 1 = 0 If we accept nothing, F 1 = 0 For perfect precision and recall, F 1 = 1 If either precision or recall drops, so does F 1

13 ©2012 Paula Matuszek Measuring Individual Features l If we have a large feature set, we may be interested in which features are actually useful. l informative features: which features gives us the biggest separation between two classes l can probably omit least informative features without impacting performance l caution: correlation, not causation...

14 ©2012 Paula Matuszek Choice of Evaluation Measure l For many tasks, F 1 gives the best overall measure –sorting news stories –deciding genre or author l But it depends on your domain –spam filters –flagging important email

15 ©2012 Paula Matuszek Evaluation: Overfitting l Training a model = predicting classification for our training set given the data in the set l Degrees of freedom: with 10 cases and 10 features I can always predict perfectly l Model may capture chance variations in set This leads to overfitting -- the model is too closely matched to the exact data set it’s been given l More likely with –large number of features –small training sets

16 ©2012 Paula Matuszek Evaluation: Training and Test Sets l To avoid (or at least detect) overfitting we always use separate training and test sets l Model is trained on one set of examples l Evaluation measures are calculated on a different set. l sets should be comparable and each should be representative of overall corpus

17 ©2012 Paula Matuszek Some classification methods l Common classification algorithms include –nearest neighbor (KNN) methods –decision trees –naive Bayes classifiers –linear classification classifiers (e.g., SVMs)

18 18 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt K-Nearest-Neighbor Algorithm Principle: points (documents) that are close in the space belong to the same class

19 19 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt K-Nearest-Neighbor Algorithm Measure of similarity between test document and each neighbor –count of words shared –tf*idf variants Select k nearest neighbors of a test document among training examples –more than 1 neighbor to avoid error of a single atypical training example –K is typically 3 or 5 Assign test document to the class which contains most of the neighbors

20 20 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Analysis of KNN Algorithm Advantages: –Effective –Can handle large, sparse vectors –“Training time” is short –Can be incremental Disadvantages: –Classification time is long –Difficult to find optimal value of k

21 21 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Decision Tree Algorithm Decision tree associated with document: –Root node contains all documents –Each internal node is subset of documents separated according to one attribute –Each arc is labeled with predicate which can be applied to attribute at parent –Each leaf node is labeled with a class

22 Example Decision Tree Text no Contains “Villanova” 01 >1 IrrelevantContains Wildcats yes Sports article Academic article General Article

23 Decision Trees for Text Each node is a single variable -- not useful for very large, very sparse vector such as BOW l Features might include –other document characteristics like diversity –counts for small subset of terms –most frequent –tf*idf –domain-based ontology

24 Creating a Decision Tree l At each node, choose function which provides maximum separation l If all examples at new node are one class, stop for that node l Recur with each mixed node l Stop when no choice improves separation -- or when you reach predefined level

25 25 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Analysis of Decision Tree Algorithm Advantages: –Easy to understand –Easy to train –Classification is fast Disadvantages: –Training time is relatively expensive –A document is only connected with one branch –Once a mistake is made at a higher level, any subtree is wrong –Not suited for very high dimensions

26 Bayesian Methods l Based on probability l Used widely in probabilistic learning and classification. l Uses prior probability of each category given no information about an item. l Categorization produces a posterior probability distribution over the possible categories given description of item.

27 Naive Bayes Bayes Theorem says we can determine probability of an event C given another event x based on –the overall probability of event C –the probability of event x given event C P(C|x) = P(x|C) * P(C)/p(x)

28 28 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Naïve Bayes Algorithm Estimate the probability of each class for a document: –Compute the posterior probability (Bayes rule) –Assumption of word independency (Naive assumption)

29 29 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Analysis of Naïve Bayes Algorithm Advantages: –Works well on numeric and textual data –Easy implementation and computation –has been effective in practice; typical spam filter, for instance Disadvantages: –Conditional independence assumption is in fact naive: usually violated by real-world data –performs poorly when features are highly correlated

30 Linear Regression l Classic linear regression: predict the value of some variable based on a weighted sum of other variables l Very common statistical technique for prediction l e.g.: predict college GPA with a weighted sum of SAT verbal and quantitative scores, high school GPA, and a “high school quality” measure

31 Linear Scoring Methods l Generalization of linear regression to much higher dimensionality l Goal is binary separation of instances into 2 classes l Best known is SVM: support vector machine. –classifier is a separating hyperplane –support vectors are those features which define the plane

32

33 33 Based on http://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppthttp://www.iro.umontreal.ca/~nie/IFT6255/Classification.ppt Support Vector Machines Main idea of SVMs Main idea of SVMs Find out the linear separating hyperplane which maximize the margin, i.e., the optimal separating hyperplane (OSH)

34 SVMS l Advantages: –Handle very large dimensionality –Empirically, have been shown to work well with text classification l Disadvantages –sensitive to noise, such as mislabeled training examples –binary only (but can train multiple SVMs) –implementation is complex: variety of implementation choices (similarity measure, kernel, etc) can require extensive tuning

35 Summary l Document classification is a common task. Manual rules provide outstanding results and allow complex structures, but very expensive to implement. l Automated methods use labeled cases to train a model –Decision trees and decision rules are easy to understand, but require good feature set tuned to domain –Nearest neighbor simple to implement and quick to train, but slow to classify. Can handle incremental training cases. –Bayes is easy to implement and works well in some domains, but can have problems with highly correlated features –SVMs more complex to implement, but handle very large dimensionality well and have proven to be best choice in many text domains


Download ppt "©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek"

Similar presentations


Ads by Google