Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009
Classifying Low/High Findable Documents Data used in the Experiment USPC Class 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing), and USPC 423 (Chemistry of inorganic compounds). Total Documents 54,353 Queries 3 terms Queries (Total 753,682), using Frequent Terms Extraction concept. (QG-FT). Retrieval System used TFIDF
Patents Extracted for Analysis Next, I extract bottom 173 (Low Findable documents) and Top 157 (High Findable documents) for analysis.
Features Extraction Next, I try to extract features from these patents, so that, can we classify Low or High Findable documents using Classification Model, without doing heavy Findability Measurement. Features that I considered useful are – Patent Length size (Only Claim). – Number of Two Terms Pairs in Claim section, which have support greater than 2. – Two Terms Pairs Frequencies in individual Patents. – Two Terms Pairs Frequencies in all Collection. – Two Terms Paris Frequencies in its most 30 Similar Patents.
Features Analysis Patent Length size (Only Claim). (First Feature) Clearly with only considering Patent Length, we can’t differentiate Low and High Findable documents. Some short length patents are high Findable, and many Longer length patents are low findable.
Features Analysis – Number of Two Term Pairs in Claim section, which have support greater than 2. (Second Feature) Again, clearly with only considering this feature, we can’t differentiate Low and High Findable documents. However, on High Findable Patents, the Support goes little bit up.
Features Analysis – Two Terms Pairs Frequencies in individual Patents, which have support greater than 2 in Claim section. (Third Feature) – The main aim of checking this feature was to analyze, are Patent writers try to hide their information (from Retrieval Systems) by lowering the frequencies of terms? – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.
Features Analysis – The frequency goes little bit up for High Findable documents, – However, still some high findable Patents have low frequencies, and some low findable Patents have high frequencies.
Features Analysis – Two Terms Pairs Frequencies in all Collection. (Fourth Feature) – The main aim of checking this feature was to analyze, the presence of Rare Term Pairs in individual Patens. – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.
Features Analysis – The frequency goes up for High Findable documents, – That’s mean Low Findable Patents frequently used Rare Terms.
Features Analysis – Two Terms Paris Frequencies in their most 30 Similar Patents. (Fifth Feature) – In last Rare terms checking analysis, I used whole collection by considering it as a single cluster. – In this feature, I create cluster for every Patent using K-NN approach. – In K-NN, I consider only 30 most Similar Patents.
Features Analysis – The frequency goes up for High Findable documents, – That’s mean the Term Pairs that are used in Low Findable Patents, could not be found in their most similar Patents.
Putting all Together Classifying Low/High Findable documents, without using Findability Measurement. I used all these features of Patents, for training classification models. For classification training, I used WEKA toolkit. In class I used L (for Low Findable), and H (for High Findable).
#r(d)F1F2F3F4F5Class H H L H H L L F1: Patent Length size (Only Claim). F2: Number of Two Terms Pairs in Claim section, which have support greater than 2. F3: Two Terms Pairs Frequencies in individual Patents. F4: Two Terms Pairs Frequencies in all Collection. F5: Two Terms Paris Frequencies in its most 30 Similar Patents Class: L (Low Findable), H (High Findable) Sample Dataset
Multilayer Perceptron (with Cross- Validation 100) Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg
Accuracy with J48 Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg
Naïve Bayes Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg
Some Other Features could be Frequency of Term Pairs in Referenced or Cited Patents. Frequency of Terms Pairs in Similar USPC classes.