Download presentation
Presentation is loading. Please wait.
Published byAngel Warner Modified over 6 years ago
1
Document Filtering Social Web 3/17/2010 Jae-wook Ahn
2
Classification Problem
Put an item into a specific category Spam filtering — spam or no-spam Topic categorization Recommendation — interested or not- interested
3
Classification — Methods
Rule-based E.g. Lot’s of capital letters Spammers learn too Customizability Statistical learning
4
Classification — Learning
Training Data Target Data Doc (Spam) Doc (Spam) Doc (Nospam) Doc Doc (Nospam) Train/Learn Classify/Filter Classifier 4
5
Tokenization Text to word Word to dictionary Why dictionary?
Word to frequency (or occurrence ) look-up
6
Training “Train” classifier with example texts
= calculate frequency distribution class classifier: (pp. 119) fc : feature → category (frequency) cc: category → frequency getfeatures — tokenizer (plugin any method)
7
Probability Calculation
Pr(word|classification) Ex. Pr(“drug”|spam) = 80 docs / total 100 spam docs = 0.8
8
Weighted Probability Doc1[… money …](s), Doc2[ … money …](s), Doc3[ … money …](s), Doc4[……](s), Doc5[……](ns) Pr(“money”|spam) = 3/4 = 0.75 Pr(“money”|no-spam) = 0/1 = 0 Pr = 0.5 (we don’t know) may be better than Pr = 0 (never) Ex. After finding one spam instance
9
Naive Bayesian Classifier
Goal = Pr(Category|Document) Ex. Pr(Spam|Doc1) = 0.001, Pr(No- spam|Doc1) = 0.5 → Doc1 = No-pam What we have is? = Pr(Feature|Category) Process = Pr(Feature|Category) → Pr(Document|Category) → Pr(Category|Document)
10
Pr(Document|Category)
Pr(Document|Category) = Pr(Feature1|Cat) * Pr(Feature2|Cat) * Pr(Feature3|Cat) … Pr(FeatureN|Cat) Pr(A ^ B) = Pr(A) * Pr(B) Assumption — A and B are independent from each other Not true — social vs. Web, social vs. Probability But still useful
11
Pr(Category|Document)
Pr(A|B) = Pr(B|A) * Pr(A) / Pr(B) Thomas Bayes Pr(Category|Document) = Pr(Document|Category) * Pr(Category) / Pr(Document) Pr(Category) = # of docs in Cat / total # of docs Pr(Document) = Constant
12
Choosing a Category Take one with the highest probability
What if, Pr(Spam|Doc) = , Pr(No- spam|Doc) = Answer may be “Not sure”
13
Choosing a Category Thresholding
If Pr(Spam|Doc) > 3 * Pr(No-spam|Doc), Then spam → which is more reasonable
14
Persisting Trained Classifier
Classifier so far, Dictionaries in memory — fc, cc Disappears after quitting from Python interpreter Should be saved to disc MySQL — client/server RDBMS SQLite — file-based RDBMS
15
Persisting Trained Classifier
Python shelve Put/Get any Python object into disk files
16
Persisting Trained Classifier
DBM, GDBM, BSDDB Unix database interface and its successors Disk-based dictionary GDBM — GNU dbm BSDDB — Berkeley DB Hash, B-Tree
17
Improved Features So far, features = words
Phrases (n-gram) — “social web”, “spam filter”, etc. Attribute — Has_Many_Uppercases = True
18
Alternative Methods Supervised learning methods Neural network
Support Vector Machine Decision Tree Software packages Weka, R, SPSS Clementine, etc
19
Weka Example Example Data Weather condition → To play or not to play?
4 attributes, 1 class variable
20
Weka Example
21
Weka Example
22
Weka Example
23
Parsing RSS Feeds Problem — extract texts from RSS structure
They are XML Parsers SAX DOM Out-of-box parser
24
SAX and DOM SAX (Simple API for XML) — serial access parser
Stream of XML data goes in Event-driven parsing DOM (Document Object Model) Use hierarchical structure for parsing
25
SAX Example
26
DOM Example
27
Ready-made Parser Universal Feed Parser <
28
Universal Feedparser
29
Core Attributes Follows RSS/ATOM syntax normalization
However, not always updated /atom10:feed/atom10:updated /atom03:feed/atom03:modified /rss/channel/pubDate /rss/channel/dc:date /rdf:RDF/rdf:channel/dc:date /rdf:RDF/rdf:channel/dcterms:modified
30
Advanced features Date parsing HTML sanitization Content normalization
Namespace handling and more...
31
Date Parsing Parses various date formats to Python 9- tuples
32
Summary Document filtering — classification problem
Statistical learning-based methods RSS parsing — XML-parsers, RSS parsers
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.