1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve.

1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve audience discussion, which will create action items. Use PowerPoint to keep track of these action items during your presentation In Slide Show, click on the right mouse button Select “Meeting Minder” Select the “Action Items” tab Type in action items as they come up Click OK to dismiss this box This will automatically create an Action Item slide at the end of your presentation with your points entered.

2 Overview Abstract Problem Description Document Classification Model Error Control Schemes –Recovery oriented scheme –Error masking scheme Experiments Conclusion

3 Abstract Traditional document classification (flat classification) involves only a single classifier –Single classifier takes care of everything –Slow and high overhead

4 Abstract Hierarchical document classification –Class hierarchy –Use one classifier at each internal node

5 Abstract Advantage –Better performance Disadvantage –Wrong result if misclassified in any node

6 Abstract Introduce error control mechanism Approach 1 (recovery oriented) –Detect and correct misclassification Approach 2 (error masking) –Mask errors by using multiple versions of classifiers

7 Problem Description class | doc_id … | … Class Taxonomy Training Documents Class-doc Relation Training System Statistics Feature Terms

8 Problem Description Classification System Statistics Feature Terms Target Class Incoming Documents

9 Problem Description Objective: Achieve –Higher accuracy –Fast performance Our proposed algorithms provide a good trade-off between accuracy and performance

10 Document Classification Model Formally, we use a model from [Chakrabarti et al. 1997] Based on naive Bayesian network For simplicity, we study a single node classifier. c c 1 c 2 … c n

11 z i,d — number of occurrence of term i in the incoming document d P j, c — probability that a word in class c is j (estimated using the training data) Probability that an incoming document d belongs to c is

12 Feature Selection Previous formula involves all the terms Feature selection reduces cost by using only the terms with good discriminating power Use the training sets to identify the feature terms

13 Fisher’s Index Fisher ’ s Index indicates the discriminating power of a term Good discriminating power: large interclass distance, small intraclass distance c1c1 c2c2 w(t)w(t) Interclass distance Intraclass distance

14 Document Classification Model Consider only feature terms in the classification function p(c i |c,d) Pick the largest probability among all c i Use one classifier in each internal node c c 1 c 2 … c n

15 Recovery Oriented Scheme Database system –Failure in DBMS –Restart from a consistent state Document classification –Error detected –Restart from a correct class (High Confidence Ancestor, or HCA)

16 Recovery Oriented Scheme In practice, –Rollback is slow –Identify wrong paths and avoid them To identify wrong paths, –Define closeness indicator (CI) –On wrong path, when CI falls below a threshold

17 Recovery Oriented Scheme Define distance of HCA and current node = 2 Wrong path HCA

18 Recovery Oriented Scheme Wrong path HCA Define distance of HCA and current node = 2

19 Error Masking Scheme Software Fault Tolerance –Run multiple versions of software –Majority voting Document Classification –Run classifiers of different designs –Majority voting

20 O-Classifier Traditional classifier

21 N-classifier Skip some intermediate levels

22 Error Masking Scheme Run three classifiers in parallel –O-classifier –N-classifier –O-classifier using new feature length This selection minimizes the time wasted on waiting the slowest classifiers

23 Experiments Data Sets –US Patents Preclassified Rich text content Highly hierarchical 3 Sets Collected –3 levels/large no of docs –4 levels/large no of docs –7 levels/small no of docs

24 Experiments Algorithm compared –Simple hierarchical –TAPER –Flat –Recovery oriented –Error masking Generally, –flat is the slowest and the most accurate –simple hierarchical is the fastest and the least accurate

25 Accuracy: 3 levels/large

26 Accuracy: 4 levels/large

27 Accuracy: 7 levels/small

28 Performance: 3 levels/large

29 Performance: 4 levels/large

30 Performance: 7 levels/small

31 Conclusion Real-life application –Large taxonomy –Flat classification is too slow Our algorithm is faster than flat classification at as low as 4 levels Performance gain widens as the number of levels increases A good trade-off between accuracy and performance for most applications

32 Thank You The End

1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve.

Similar presentations

Presentation on theme: "1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve.

Similar presentations

Presentation on theme: "1 Hierarchical Classification of Documents with Error Control Chun-Hung Cheng, Jian Tang, Ada Wai-chee Fu, Irwin King This presentation will probably involve."— Presentation transcript:

Similar presentations

About project

Feedback