Implementation Details of the Text Classification Project

Implementation Details of the Text Classification Project
Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo Spring 2001

Feature Selection Step
We select keywords from text by using some way of scoring words. Here, Information Gain is being used. For each unique word, the number of documents in each class, in which the word occurs, is noted.

Feature Selection Step - Algorithm
for each document d in training set for each word w if w has been encountered before increment the document count for Category(d) in record for w else create a new data record for w using the record for w, calculate Information Gain Select NUM_KEYWORDS with highest Information Gain.

Feature Selection Word Cat1 Cat2 Cat3 Cat4 …. Cat20 Nation 5 15 4 3 1
God 12 13 7 9 Soccer 6 2 19 News 10

Information Gain G (t) = - i=1 to m Pr (ci) log Pr (ci) +
Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) + Pr(t) i=1 to m Pr (ci|t) log Pr (ci|t) Pr (ci) = 1/ 20 Pr (t) = (i=1 to m Catm(t)) / (i=1 to m j=1 to w Catm(j)) Pr (ci|t) = Catm (t) / i=1 to m Catm(t)

Classification Algorithm
KeyWord Cat1 Cat2 Cat3 Cat4 …. Cat20 Nation 5 15 4 3 1 God 12 13 7 9 Soccer 6 2 19 News 10

Implementation Details of the Text Classification Project

Similar presentations

Presentation on theme: "Implementation Details of the Text Classification Project"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementation Details of the Text Classification Project

Similar presentations

Presentation on theme: "Implementation Details of the Text Classification Project"— Presentation transcript:

Similar presentations

About project

Feedback