AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University.

Slides:



Advertisements
Similar presentations
Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.
Advertisements

Text Categorization.
Chapter 5: Introduction to Information Retrieval
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Evaluation of Decision Forests on Text Categorization
November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Learning for Text Categorization
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Ch 4: Information Retrieval and Text Mining
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Module 04: Algorithms Topic 07: Instance-Based Learning
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
K Nearest Neighborhood (KNNs)
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
SINGULAR VALUE DECOMPOSITION (SVD)
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
A Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Vector Quantization CAP5015 Fall 2005.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Overview Data Mining - classification and clustering
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Autumn Web Information retrieval (Web IR) Handout #14: Ranking Based on Click Through data Ali Mohammad Zareh Bidoki ECE Department, Yazd University.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Plan for Today’s Lecture(s)
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Instance Based Learning
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Efficient Ranking of Keyword Queries Using P-trees
Mean Shift Segmentation
Classification Nearest Neighbor
Computer Vision Lecture 16: Texture II
A Fast and Scalable Nearest Neighbor Based Classification
Representation of documents and queries
North Dakota State University Fargo, ND USA
North Dakota State University Fargo, ND USA
Text Categorization Berlin Chen 2003 Reference:
Chapter 7: Transformations
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Latent Semantic Analysis
Presentation transcript:

AN OPTIMIZED APPROACH FOR kNN TEXT CATEGORIZATION USING P-TREES Imad Rahal and William Perrizo Computer Science Department North Dakota State University Fargo, ND

2 Outline The Text Categorization problem The P-tree technology Vector Space Model Proposed Solution Intervalization (discretization) P-tree representation Similarity measures Categorization algorithm Performance analysis study

3 Text categorization problem Text Categorization ( topic spotting or text classification ) is the process of assigning categories or labels to documents based entirely on their contents Problems text has no explicit structured unlike other data (e.g. relational data) information is described freely in the documents (After introducing structure) huge number of features

4 Motivation Increase in the number of text documents (on the Internet!) Medical articles Research Publications s News reports (e.g. Reuters) others Most algorithms fail to scale up because of the curse of dimensionality Most algorithms suffer from relatively low accuracy

5 The P-tree technology Tree-like data structure that store numeric (and categorical) relational data in bit-compressed format by splitting each attribute into bits representing each bit position by a P-tree

6 Transformation to binary

7 Each binary column will form a P- tree

8 AND and OR operations

9 Complement operation

10 P-trees are characterized by 1-time creation cost Compression High speed processing (ANDing, no DB scans) The latest bench mark on P-tree ANDing has shown a speed of 6 ms for two 1320x1320 images (i.e. two bit sequences each containing 1.6 million bits represented using P-trees)

11 We have 8 P-trees in total for each attribute shown in the previous example: PA,7 PA,6 PA,5 PA,4 PA,3 PA,2 PA,1 and PA,0 To query for a certain attribute value, say Attribute A = , we do the following: PA, = PA,7 & PA,6 & PA,5 & P’A,4 & P’A,3 & P’A,2 & P’A,1 & PA,0 We can have varying bit precision. We query for A = 001, we do the following: PA,001 = P’A,3 & P’A,2 & PA,0

12 Vector Space Model Each document is represented as a vector whose dimensions are the terms in the initial document collection Each vector coordinate is a term and has a numeric value which represents its relevance to the document. Usually higher values imply higher relevance

13 Three popular weighting schemes are: Binary, TF, and TF*IDF. The binary scheme uses the values 1 and 0 to reveal whether a term exists in the document or not The term frequency (TF) scheme counts the occurrences of a term in a document. Usually measures are normalized to help overcome the problems associated with document length

14 The TF*IDF scheme multiplies the coordinate measure derived by the TF scheme by a global weight called the IDF. The IDF measure for term t is defined as log(N/Nt) where N is the total number of documents and Nt is the total number of documents containing t. The cosine normalization is usually used

15 Proposed solution Model 1: Classification over binary representation is not accurate but fast Model 2: Classification using exact counts (tf, idf, normalized tf…) more accurate but slower (very high dimensional space) This can be viewed as a concept hierarchy

16 Work along this hierarchy by using intervals Better speed than model 2 (approaching to Model1) Better accuracy than model 1 (approaching Model2)

17 An example say we’re using TF (values normalized in the range of [0,1]) divide range into 4 intervals: None, Low, Medium, High Each interval will be represented by a string of bits (we have four intervals so we need 2 bits) None = “00”, Low =“01”, Medium = “10” and High=“11” (note the order among them) Each bit position will be represented by a P- tree; so we have 2 P-trees for every dimension

18

19 kNN Algorithm Used to find the k most similar points (referred to as k neighbours) to some given point P in some space and then assigning a proper class to P using the class labels of the k neighbours Usually proceeds by the selecting the neighbours first (selection phase) and then assigning the class label (voting phase)

20 Categorization Algorithm: Selection Phase Initialize a P-tree, Pnn, to contain only pure-1 quadrants (i.e. all entries in it are 1’s) – identity P-tree Order the set of all term P-trees S in descending order from term P-trees representing higher to lower interval values in dnew For every term P-tree, Pt, in S do the following AND Pnn with Pt If root count of Pnn is less than k, expand Pt by removing the rightmost bit from the interval value (i.e. interval 01 and 00 become 0 and intervals 10 and 11 become 1). This could be done by recalculating the Pt while disregarding the rightmost bit P-tree. Repeat this step until the root count of Pnn AND Pt is greater than k – this is guaranteed to happen at least when all the bits are disregarded. Else, put the result in Pnn Loop End of selection phase

21 PnnP3P7P6P4P5P1P2

22 Categorization Algorithm: Voting Phase For every class ci, loop through dnew vector and do the following for every term tj in dnew vector: Get the P-tree representing the neighboring documents (Pnn from the selection phase) having the same value for t (Pt) and class ci (Pi). This could be done by calculating Presult = Ptj AND Pnn AND Pi If the term under consideration has a value Ij then multiply the root count of Presult by (Ij+1) //if we want to neglect Ij=“00” then don’t add 1 Add the result to the counter of ci, w(ci). Loop Select the class ck having the largest counter w(ck) as the respective class of dnew End of voting phase

23 Performance analysis study Compared accuracy and speed to cosine- similarity KNN and accuracy to string kernels approach by Lodhi et al. (Journal of Machine Learning Feb. 2002) Speed Used synthetic document x term matrices with different sizes

24

25 Accuracy: Followed the sampling approach depicted in the string kernels approach Tested over a subset of the Reuters collection (analysis over the whole dataset if still underway) Experimented on four classes namely: acquisition, earn, corn, and crude. We used k=3 and a 4-interval value set, I0=[0,0], I1=(0,0.25], I2=(0.25,0.75] and I3=(0.75,1]. Averaged precision (not shown), recall (not shown) and F1-measures (2pr/(p+r)) for our approach and cosine KNN and compared with string kernels

26 F-1 Measure values Class/MethodP-tree basedKNNString kernels Earn Acq Crude Corn

27 Compared to the KNN approach, we show much better results in terms of speed and accuracy The reason for the improvement in speed is mainly related to the complexity of the selection phases: O(n) VS O(mn) where m is the size of the dataset – number of rows – and n is the number of dimensions. and P-tree ANDing speed.

28 As for accuracy, the KNN approach uses the angle between the vectors and considers all terms Our approach uses ANDing to compare the closeness of the value of each term and to ignore unneeded terms (those whose ANDing renders a less than k neighbors)

29 As for the kernels approach, it would not be appropriate to compare speeds here because the two approaches are fundamentally different. Example-based VS Eager Context sensitive VS Context insensitive In general, results were very comparable results

30 The range for the precision, recall and F1 measurements in the other two approaches spreads over a wider range than they do in ours which indicates that our P-tree based approach’s accuracy is less variable across categories or classes thus leading to more stable results in general

31 Drawsbacks Needs tuning We need to decide upon the number of intervals and their ranges ahead of time (analysis for varying those is still underway) Since this is a KNN algorithm, K must also be known ahead of time

32 Conclusion We have shown Higher accuracy the use of sequential ANDing in selection Very fair voting Use of closed neighbourhood (in case root count is greater than K) – refer to Maleq Khan’s thesis (Dec. 2001) for previous work

33 Better space utilization reduced compressed space Reduced space due to intervalization (from 8 bits to 2 bits  reduction by a factor of 4) Compression due to the use of P-trees Higher speed Due to P-trees No DB scans Based on the AND operation which is among the fastest computer instructions

34 Future direction Solve the problem of random ANDing for term P-trees having the same values? Information gain? Test the effects of varying the number of intervals and their values over different datasets Analyze speed and accuracy results over large datasets (all Reuters collection)