Download presentation
Presentation is loading. Please wait.
1
CES 514 Lec 11 April 28,2010 Neural Network, case study of naïve Bayes and decision tree, text classification
2
Artificial Neural Networks (ANN) Output Y is 1 if at least two of the three inputs are equal to 1.
3
Neural Network with one neuron Rosenblatt 1958(perceptron) (also known as threshold logic unit)
4
Artificial Neural Networks (ANN) l Model is an assembly of inter-connected nodes and weighted links l Output node sums up each of its input value according to the weights of its links l Compare output node against some threshold t Perceptron Model or
5
Training a single neuron Rosenblatt’s algorithm:
6
Linearly separable instances Rosenblatt’s algorithm converges and finds a separating plane when the data set is linearly separable. Simplest example of instance that is not linearly separable: exclusive-OR (parity function)
7
Classifying parity with more neurons A neural network with sufficient number of neurons can classify any data set correctly.
8
General Structure of ANN Training ANN means learning the weights of the neurons
9
Algorithm for learning ANN l Initialize the weights (w 0, w 1, …, w k ) l Adjust the weights in such a way that the output of ANN is consistent with class labels of training examples –Objective function: –Find the weights w i ’s that minimize the above objective function e.g., backpropagation algorithm l details: Nillson’s ML (Chapter 4) PDF
10
WEKA
11
WEKA implementation WEKA has implementation of all the major data mining algorithms including: decision trees (CART, C4.5 etc.) naïve Bayes algorithm and all variants nearest neighbor classifier linear classifier Support Vector Machine clustering algorithms boosting algorithms etc.
12
Weka tutorials http://sentimentmining.net/weka/ Contains videos showing how to use weka for various data mining applications.
13
A case study in classification CES 514 course project from 2007 (Olson) Consider a board game (e.g checkers, backgammon). Given a position, we want to determine how strong the position of one player (say black) is. Can we train a classifier to learn this from training set? As usual, problems are: choice of attributes creating labeled samples
14
Peg Solitaire – one player version of checkers To win, player should remove all except one peg. A position from which a win can achieved is called a solvable position.
15
Square board and a solvable position Winning move sequence: (3, 4, 5), (5, 13, 21), (25, 26, 27), (27, 28, 29), (21, 29, 37), (37, 45, 53), (83, 62, 61), (61, 53, 45)
16
How to choose attributes? 1.Number of pegs (pegs). 2.Number of first moves for any peg on the board (first_moves). 3.Number of rows having 4 pegs separated by single vacant positions (ideal_row). 4.Number of columns having 4 pegs separated by single vacant positions (ideal col). 5.Number of the first two moves for any peg on the board (first_two). 6.Percentage of the total number of pegs in quadrant one (quad_one). 7.Percentage of the total number of pegs in quadrant two (quad_two).
17
List of attributes Percentage of the total number of pegs in quadrant three (quad_three). Percentage of the total number of pegs in quadrant four (quad_four). Number of pegs isolated by one vacant position (island_one). Number of pegs isolated by two vacant positions (island_two). Number of rows having 3 pegs separated by single vacant positions (ideal_row_three). Number of columns having 3 pegs separated by single vacant positions (ideal_col_three).
24
Summary of performance
25
Text Classification Text classification has many applications –Spam email detection –Automated tagging of streams of news articles, e.g., Google News –Online advertising: what is this Web page about? Data Representation –“Bag of words” most commonly used: either counts or binary –Can also use “phrases” (e.g., bigrams) for commonly occurring combinations of words Classification Methods –Naïve Bayes widely used (e.g., for spam email) Fast and reasonably accurate –Support vector machines (SVMs) Typically the most accurate method in research studies But more complex computationally –Logistic Regression (regularized) Not as widely used, but can be competitive with SVMs (e.g., Zhang and Oles, 2002)
26
Types of Labels/Categories/Classes Assigning labels to documents or web-pages –Labels are most often topics such as Yahoo- categories –"finance“,"sports,"news>world>asia>business" Labels may be genres –"editorials" "movie-reviews" "news” Labels may be opinion on a person/product –“like”, “hate”, “neutral” Labels may be domain-specific –"interesting-to-me" : "not-interesting-to-me” –“contains adult language” : “doesn’t” –language identification: English, French, Chinese, … Ch. 13
27
Common Data Sets used for Evaluation Reuters –10700 labeled documents –10% documents with multiple class labels Yahoo! Science Hierarchy –95 disjoint classes with 13,598 pages 20 Newsgroups data –18800 labeled USENET postings –20 leaf classes, 5 root level classes WebKB –8300 documents in 7 categories such as “faculty”, “course”, “student”.
28
Practical Issues Tokenization –Convert document to word counts = “bag of words” –word token = “any nonempty sequence of characters” –for HTML (etc) need to remove formatting Canonical forms, Stopwords, Stemming –Remove capitalization –Stopwords: remove very frequent words (a, the, and…) – can use standard list Can also remove very rare words, e.g., words that only occur in k or fewer documents, e.g., k = 5 Data representation –e.g., sparse 3 column for bag of words: –can use inverted indices, etc
29
challenges of text classification M.L classification techniques used for structured data Text: lots of features and lot of noise No fixed number of columns No categorical attribute values Data scarcity Larger number of class label Hierarchical relationships between classes less systematic unlike structured data
30
Techniques Nearest Neighbor Classifier Lazy learner: remember all training instances Decision on test document: distribution of labels on the training documents most similar to it Assigns large weights to rare terms Feature selection removes terms in the training documents which are statistically uncorrelated with the class labels Bayesian classifier Fit a generative term distribution Pr(d|c) to each class c of documents. Testing: The distribution most likely to have generated a test document is used to label it.
31
Stochastic Language Models Model probability of generating strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model 0.2the 0.1a 0.01man 0.01woman 0.03said 0.02likes … themanlikesthewoman 0.20.010.020.20.01 multiply Model M P(s | M) = 0.00000008 Sec.13.2.1
32
Stochastic Language Models Model probability of generating any string 0.2the 0.01class 0.0001sayst 0.0001pleaseth 0.0001yon 0.0005maiden 0.01woman Model M1 Model M2 maidenclasspleasethyonthe 0.00050.010.0001 0.2 0.010.00010.020.10.2 p(s|M2) > p(s|M1) 0.2the 0.0001class 0.03sayst 0.02pleaseth 0.1yon 0.01maiden 0.0001woman Sec.13.2.1
33
Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method Attributes are text positions, values are words. too many possibilities Assume that classification is independent of the positions of the words Use same parameters for each position Result is bag of words model (over tokens) Sec.13.2
34
Text j single document containing all docs j for each word x k in Vocabulary n k number of occurrences of x k in Text j Naive Bayes: Learning From training corpus, extract vocabulary Calculate required P(c j ) and P(x k | c j ) terms For each c j in C do docs j subset of documents for which the target class is c j Sec.13.2
35
Naive Bayes: Classifying positions all word positions in current document which contain tokens found in Vocabulary Return c NB, where Sec.13.2
37
Naive Bayes: Time Complexity Training Time: O(|D|L ave + |C||V|)) where L ave is the average length of a document in D. Assumes all counts are pre-computed in O(|D|L ave ) time during one pass through all of the data. Generally just O(|D|L ave ) since usually |C||V| < |D|L ave Test Time: O(|C| L t ) where L t is the average length of a test document. Very efficient overall, linearly proportional to the time needed to just read in all the data. Sec.13.2
38
Underflow Prevention: using logs Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Class with highest final un-normalized log probability score is still the most probable. Note that model is now just max of sum of weights… Sec.13.2
39
Naive Bayes Classifier Simple interpretation: Each conditional parameter log P(x i |c j ) is a weight that indicates how good an indicator x i is for c j. The prior log P(c j ) is a weight that indicates the relative frequency of c j. The sum is then a measure of how much evidence there is for the document being in the class. We select the class with the most evidence for it 39
40
Two Naive Bayes Models Model 1: Multivariate Bernoulli One feature X w for each word in dictionary X w = true in document d if w appears in d Naive Bayes assumption: Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears This is the model used in the binary independence model in classic probabilistic relevance feedback on hand- classified data.
41
Two Models Model 2: Multinomial = Class conditional unigram One feature X i for each word pos in document feature’s values are all words in dictionary Value of X i is the word in position i Naïve Bayes assumption: Given the document’s topic, word in one position in the document tells us nothing about words in other positions Second assumption: Word appearance does not depend on position Just have one multinomial feature predicting all words for all positions i,j, word w, and class c
42
Multivariate Bernoulli model: Multinomial model: Can create a mega-document for topic j by concatenating all documents in this topic Use frequency of w in mega-document Parameter estimation fraction of documents of topic c j in which word w appears fraction of times in which word w appears among all words in documents of topic c j
43
Classification Multinomial vs Multivariate Bernoulli? Multinomial model is almost always more effective in text applications
48
Feature Selection: Why? Text collections have a large number of features 10,000 – 1,000,000 unique words … and more May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting Sec.13.5
49
Feature selection: how? Two ideas: Hypothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another? Chi-square test ( 2 ) Information theory: How much information does the value of one categorical variable give you about the value of another? Mutual information They’re similar, but 2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities) Sec.13.5
50
2 statistic (CHI) 2 is interested in (f o – f e ) 2 /f e summed over all table entries: is the observed number what you’d expect given the marginals? The null hypothesis is rejected with confidence.999, since 12.9 > 10.83 (the value for.999 confidence). 9500 500 (4.75) (0.25) (9498)3 Class auto (502)2Class = auto Term jaguar Term = jaguar expected: f e observed: f o Sec.13.5.2
51
There is a simpler formula for 2x2 2 : 2 statistic N = A + B + C + D D = #(¬t, ¬c)B = #(t,¬c) C = #(¬t,c)A = #(t,c) Sec.13.5.2
52
Feature selection via Mutual Information In training set, choose k words which best discriminate (give most info on) the categories. The Mutual Information between a word, class is: For each word w and each category c Sec.13.5.1
53
Feature selection via MI For each category we build a list of k most discriminating terms. For example (on 20 Newsgroups): sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, … rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, … Greedy: does not account for correlations between terms Why? Sec.13.5.1
54
Feature Selection Mutual Information Clear information-theoretic interpretation May select rare uninformative terms Chi-square Statistical foundation May select very slightly informative frequent terms that are not very useful for classification Just use the commonest terms? No particular foundation In practice, this is often 90% as good Sec.13.5
55
Greedy inclusion algorithm Most commonly used in text Algorithm: Compute, for each term, a measure of discrimination amongst classes. Arrange the terms in decreasing order of this measure. Retain a number of the best terms or features for use by the classifier. Greedy because measure of discrimination of a term is computed independently of other terms Over-inclusion: mild effects on accuracy
56
Feature selection - performance Bayesian classifier cannot over fit much Effect of feature selection on Bayesian classifiers
57
Naive Bayes vs. other methods 57 Sec.13.6
59
Benchmarks for accuracy Reuters 10700 labeled documents 10% documents with multiple class labels OHSUMED 348566 abstracts from medical journals 20NG 18800 labeled USENET postings 20 leaf classes, 5 root level classes WebKB 8300 documents in 7 academic categories.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.