Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Authorship Verification Authorship Identification Authorship Attribution Stylometry.
Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,
Lecture 22: Evaluation April 24, 2010.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
Classification by Machine Learning Approaches - Exercise Solution Michael J. Kerner – Center for Biological Sequence.
Instance Based Learning. Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return the answer associated.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.
Chapter 5 Data mining : A Closer Look.
Roots and Fractional Exponents. You know a square root means a number you take times itself to get a given answer.
Yoonjung Choi.  The Knowledge Discovery in Databases (KDD) is concerned with the development of methods and techniques for making sense of data.  One.
Data Mining – Credibility: Evaluating What’s Been Learned
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Advanced Multimedia Text Classification Tamara Berg.
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Evaluation – next steps
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
WOW World of Walkover-weight “My God, it’s full of cows!” (David Bowman, 2001)
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Experiments in Machine Learning COMP24111 lecture 5 Accuracy (%) A BC D Learning algorithm.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SDO Progress Presentation. Agenda Benchmark dataset – Acquisition – Future additions – Class balancing and problems Image Processing – Image parameters.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
An Exercise in Machine Learning
Evaluating Classification Performance
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Evaluating Classifiers Reading: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)An introduction to ROC analysis.
Improving the Classification of Unknown Documents by Concept Graph Morteza Mohagheghi Reza Soltanpour
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
ID Identification in Online Communities Yufei Pan Rutgers University.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Team: flyingsky Reporter: YanJie Fu & ChuanRen Liu Institution: Chinese Academy of Sciences.
IR 6 Scoring, term weighting and the vector space model.
Big Data Processing of School Shooting Archives
Heart Sound Biometrics for Continual User Authentication
ECE 471/571 - Lecture 19 Review 02/24/17.
Evaluating Classifiers
Performance Evaluation 02/15/17
Data Mining – Credibility: Evaluating What’s Been Learned
Find the Features of Noses
Features & Decision regions
Evaluation and Its Methods
Experiments in Machine Learning
CS539: Project 3 Zach Pardos.
Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.
Large Scale Findability Analysis
Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Neural Networks Weka Lab
Presentation transcript:

Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009

Classifying Low/High Findable Documents Data used in the Experiment USPC Class 422 (Chemical apparatus and process disinfecting, deodorizing, preserving, or sterilizing), and USPC 423 (Chemistry of inorganic compounds). Total Documents 54,353 Queries 3 terms Queries (Total 753,682), using Frequent Terms Extraction concept. (QG-FT). Retrieval System used TFIDF

Patents Extracted for Analysis Next, I extract bottom 173 (Low Findable documents) and Top 157 (High Findable documents) for analysis.

Features Extraction Next, I try to extract features from these patents, so that, can we classify Low or High Findable documents using Classification Model, without doing heavy Findability Measurement. Features that I considered useful are – Patent Length size (Only Claim). – Number of Two Terms Pairs in Claim section, which have support greater than 2. – Two Terms Pairs Frequencies in individual Patents. – Two Terms Pairs Frequencies in all Collection. – Two Terms Paris Frequencies in its most 30 Similar Patents.

Features Analysis Patent Length size (Only Claim). (First Feature) Clearly with only considering Patent Length, we can’t differentiate Low and High Findable documents. Some short length patents are high Findable, and many Longer length patents are low findable.

Features Analysis – Number of Two Term Pairs in Claim section, which have support greater than 2. (Second Feature) Again, clearly with only considering this feature, we can’t differentiate Low and High Findable documents. However, on High Findable Patents, the Support goes little bit up.

Features Analysis – Two Terms Pairs Frequencies in individual Patents, which have support greater than 2 in Claim section. (Third Feature) – The main aim of checking this feature was to analyze, are Patent writers try to hide their information (from Retrieval Systems) by lowering the frequencies of terms? – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Features Analysis – The frequency goes little bit up for High Findable documents, – However, still some high findable Patents have low frequencies, and some low findable Patents have high frequencies.

Features Analysis – Two Terms Pairs Frequencies in all Collection. (Fourth Feature) – The main aim of checking this feature was to analyze, the presence of Rare Term Pairs in individual Patens. – Since, there could many pairs in each Patents, therefore in analysis, I take average of their support values.

Features Analysis – The frequency goes up for High Findable documents, – That’s mean Low Findable Patents frequently used Rare Terms.

Features Analysis – Two Terms Paris Frequencies in their most 30 Similar Patents. (Fifth Feature) – In last Rare terms checking analysis, I used whole collection by considering it as a single cluster. – In this feature, I create cluster for every Patent using K-NN approach. – In K-NN, I consider only 30 most Similar Patents.

Features Analysis – The frequency goes up for High Findable documents, – That’s mean the Term Pairs that are used in Low Findable Patents, could not be found in their most similar Patents.

Putting all Together Classifying Low/High Findable documents, without using Findability Measurement. I used all these features of Patents, for training classification models. For classification training, I used WEKA toolkit. In class I used L (for Low Findable), and H (for High Findable).

#r(d)F1F2F3F4F5Class H H L H H L L F1: Patent Length size (Only Claim). F2: Number of Two Terms Pairs in Claim section, which have support greater than 2. F3: Two Terms Pairs Frequencies in individual Patents. F4: Two Terms Pairs Frequencies in all Collection. F5: Two Terms Paris Frequencies in its most 30 Similar Patents Class: L (Low Findable), H (High Findable) Sample Dataset

Multilayer Perceptron (with Cross- Validation 100) Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg

Accuracy with J48 Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg

Naïve Bayes Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % Total Number of Instances 330 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class L H Weighted Avg

Some Other Features could be Frequency of Term Pairs in Referenced or Cited Patents. Frequency of Terms Pairs in Similar USPC classes.