Document Filtering Social Web 3/17/2010 Jae-wook Ahn
Classification Problem Put an item into a specific category Spam filtering — spam or no-spam Topic categorization Recommendation — interested or not- interested
Classification — Methods Rule-based E.g. Lot’s of capital letters Spammers learn too Customizability Statistical learning
Classification — Learning Training Data Target Data Doc (Spam) Doc (Spam) Doc (Nospam) Doc Doc (Nospam) Train/Learn Classify/Filter Classifier 4
Tokenization Text to word Word to dictionary Why dictionary? Word to frequency (or occurrence ) look-up
Training “Train” classifier with example texts = calculate frequency distribution class classifier: (pp. 119) fc : feature → category (frequency) cc: category → frequency getfeatures — tokenizer (plugin any method)
Probability Calculation Pr(word|classification) Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8
Weighted Probability Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns) Pr(“money”|spam) = 3/4 = 0.75 Pr(“money”|no-spam) = 0/1 = 0 Pr = 0.5 (we don’t know) may be better than Pr = 0 (never) Ex. After finding one spam instance
Naive Bayesian Classifier Goal = Pr(Category|Document) Ex. Pr(Spam|Doc1) = 0.001, Pr(No- spam|Doc1) = 0.5 → Doc1 = No-pam What we have is? = Pr(Feature|Category) Process = Pr(Feature|Category) → Pr(Document|Category) → Pr(Category|Document)
Pr(Document|Category) Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2|Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) Pr(A ^ B) = Pr(A) * Pr(B) Assumption — A and B are independent from each other Not true — social vs. Web, social vs. Probability But still useful
Pr(Category|Document) Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B) Thomas Bayes Pr(Category|Document) = Pr(Document|Category) * Pr(Category) / Pr(Document) Pr(Category) = # of docs in Cat / total # of docs Pr(Document) = Constant
Choosing a Category Take one with the highest probability What if, Pr(Spam|Doc) = 0.000001, Pr(No- spam|Doc) = 0.0000005 Answer may be “Not sure”
Choosing a Category Thresholding If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc), Then spam → which is more reasonable
Persisting Trained Classifier Classifier so far, Dictionaries in memory — fc, cc Disappears after quitting from Python interpreter Should be saved to disc MySQL — client/server RDBMS SQLite — file-based RDBMS
Persisting Trained Classifier Python shelve Put/Get any Python object into disk files
Persisting Trained Classifier DBM, GDBM, BSDDB Unix database interface and its successors Disk-based dictionary GDBM — GNU dbm BSDDB — Berkeley DB Hash, B-Tree
Improved Features So far, features = words Phrases (n-gram) — “social web”, “spam filter”, etc. Attribute — Has_Many_Uppercases = True
Alternative Methods Supervised learning methods Neural network Support Vector Machine Decision Tree Software packages Weka, R, SPSS Clementine, etc
Weka Example Example Data Weather condition → To play or not to play? 4 attributes, 1 class variable
Weka Example
Weka Example
Weka Example
Parsing RSS Feeds Problem — extract texts from RSS structure They are XML Parsers SAX DOM Out-of-box parser
SAX and DOM SAX (Simple API for XML) — serial access parser Stream of XML data goes in Event-driven parsing DOM (Document Object Model) Use hierarchical structure for parsing
SAX Example
DOM Example
Ready-made Parser Universal Feed Parser <http://www.feedparser.org>
Universal Feedparser
Core Attributes Follows RSS/ATOM syntax normalization However, not always updated /atom10:feed/atom10:updated /atom03:feed/atom03:modified /rss/channel/pubDate /rss/channel/dc:date /rdf:RDF/rdf:channel/dc:date /rdf:RDF/rdf:channel/dcterms:modified
Advanced features Date parsing HTML sanitization Content normalization Namespace handling and more...
Date Parsing Parses various date formats to Python 9- tuples
Summary Document filtering — classification problem Statistical learning-based methods RSS parsing — XML-parsers, RSS parsers