Download presentation
Presentation is loading. Please wait.
Published byErin Clark Modified over 6 years ago
1
Implementation Details of the Text Classification Project
Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo Spring 2001
2
Feature Selection Step
We select keywords from text by using some way of scoring words. Here, Information Gain is being used. For each unique word, the number of documents in each class, in which the word occurs, is noted.
3
Feature Selection Step - Algorithm
for each document d in training set for each word w if w has been encountered before increment the document count for Category(d) in record for w else create a new data record for w using the record for w, calculate Information Gain Select NUM_KEYWORDS with highest Information Gain.
4
Feature Selection Word Cat1 Cat2 Cat3 Cat4 …. Cat20 Nation 5 15 4 3 1
God 12 13 7 9 Soccer 6 2 19 News 10
5
Information Gain G (t) = - i=1 to m Pr (ci) log Pr (ci) +
Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) Pr (ci) = 1/ 20 Pr (t) = (i=1 to m Catm(t)) / (i=1 to m j=1 to w Catm(j)) Pr (ci|t) = Catm (t) / i=1 to m Catm(t)
6
Classification Algorithm
KeyWord Cat1 Cat2 Cat3 Cat4 …. Cat20 Nation 5 15 4 3 1 God 12 13 7 9 Soccer 6 2 19 News 10
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.