......................................... Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

Slides:



Advertisements
Similar presentations
Chapter 14: Usability testing and field studies
Advertisements

Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Analyzing Retrieval Models using Retrievability Measurement Shariq Bashir Supervisor: ao. Univ. Prof. Dr. Andreas.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Optimizing search engines using clickthrough data
Analyzing Document Retrievability in Patent Retrieval Settings Shariq Bashir, and Andreas Rauber DEXA 2009, Linz,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004.
How to Make Manual Conjunctive Normal Form Queries Work in Patent Search Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Low/High Findability Analysis Shariq Bashir Vienna University of Technology Seminar on 2 nd February, 2009.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Large Scale Findability Analysis Shariq Bashir PhD-Candidate Department of Software Technology and Interactive Systems.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 4 Query Languages.
Important Task in Patents Retrieval Recall is an Important Factor Given Query Patent -> the Task is to Search all Related Patents Patents have Complex.
Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Evaluating Retrieval Systems with Findability Measurement Shariq Bashir PhD-Student Technology University of Vienna.
Overview of Search Engines
Mining and Summarizing Customer Reviews
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
APPLYING EPSILON-DIFFERENTIAL PRIVATE QUERY LOG RELEASING SCHEME TO DOCUMENT RETRIEVAL Sicong Zhang, Hui Yang, Lisa Singh Georgetown University August.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Personalized Search Cheng Cheng (cc2999) Department of Computer Science Columbia University A Large Scale Evaluation and Analysis of Personalized Search.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Chapter 6: Information Retrieval and Web Search
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Dataware’s Document Clustering and Query-By-Example Toolkits John Munson Dataware Technologies 1999 BRS User Group Conference.
Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Information Retrieval Effectiveness of Folksonomies on the World Wide Web P. Jason Morrison.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Information Retrieval
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Personalizing Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Seesaw Personalized Web Search Jaime Teevan, MIT with Susan T. Dumais and Eric Horvitz, MSR.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Queensland University of Technology
A Markov Random Field Model for Term Dependencies
Retrieval Utilities Relevance feedback Clustering
Review 1+3= 4 7+3= = 5 7+4= = = 6 7+6= = = 7+7+7=
Large Scale Findability Analysis
Presentation transcript:

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna University of Technology

General Theme  In Automatic Evaluation of IR systems, query generation contains valuable importance.  Generally, query generation space is very large.  Need to understand, how to generate reasonable queries.  In this work, we understand this issue with the help Patent Search QUERY Log.

Automatic Query Generation for Analysis  Motivation/Problem –Patents contain large number of terms. –IR systems analysis using all combinations of terms is a difficult task. Demands large processing time. Can give wrong picture –A large combination of query terms are never used by users. –Question? How to generate reasonable queries?

Query Log of Patents Search  (Patents Search Query Log) can help in generating queries for Analysis.  Patent search users are more experimented, we can utilize their experienced for effective queries generation.  In Query Log Analysis, on one side we have Query Patents and on the other side, we have their Query Logs –So this helps us in understanding –The types of terms that are mostly used for searching patents. –Can Prune Irrelevant Terms.

Applications of Query Log Analysis –Analyzing Bias of Retrieval Systems (Findability of Documents). –Selecting Terms for Query Expansion. –Learn to Rank for Prior-Art Search.

Experiments (QUERY Log DATASET)  Patent Search Query Log can be downloadable from USPTO portal (  Can’t be downloadable as a whole. Can be downloadable manually on individual patent basis.  Available in Scan Format, need OCR to convert in digital text format.  Need further cleansing operations to remove noise in queries. –Some queries contain past queries reference numbers. –There were lot of number in the queries. Patents application number IPC classes

QUERY Log Example

Queries contain queries references

Queries contain patent application numbers

Queries contain IPC classes

Experiments (QUERY Log DATASET)  242 Query Log of Patents are used for analysis.  queries.  We only considered the text queries for analysis.

Query Log Analysis  Given Query Log, we analyze it on the basis of following factors. 1.Term Frequencies of Query Terms. 1.Does Frequency of Terms in Patents contain any importance in Query Formulation? 2.Proximity/Closeness of Query Terms in Patent Text. 3.Query Terms Confidence in Similar IPC Classes. 4.Number of Retrieved Documents. Query Log of (Y) Query Patent (Y) Understand diff between (All Terms of Patents/ and only Query Log Terms) Automatic Queries Generation All Terms of Query Patent All Terms of Query Log

Terms Frequencies in Patents (1) All Terms of Query Patents: 1.Large percentage of Terms in Patents have lower frequency. 2.While, very few percentage of Terms have higher frequency > 10.

Terms Frequencies in Patents (1) [Percentage/out of Total Terms] Selected in Queries: 1.Higher Frequency Terms have very good percentage of selection in Queries. 2.Lower Frequency Terms such as <= 5, contain very poor percentage. Note in last slide almost 75% of Terms in Patents have <= 5 Frequency.

Terms Frequencies in Patents (1) [Percentage/out of Query Terms] Appeared in Query Log: 1.Higher Frequency Terms are more frequently appeared in Query Log as compared to Lower Frequency Terms (<= 5).

Terms Proximity/Closeness in Query Log (2)  Proximity refers to closeness of Two Terms in Patent Text.  Helps in understanding whether Terms Proximity contains any importance in Queries formulation.  Proximity of Terms is calculated with two approaches –Minimum distance between two terms. –Co-Occurrence Frequency using Window Size.  Terms Pairs are selected based upon two factors –All Terms pairs of Query Patent. –Only Terms pairs that appeared in Query Log.

Terms Proximity/Closeness in Query Log  With Minimum Distance: –Lower Proximity Pairs are appeared in a larger percentage in Query Log, as compared to Higher Proximity Pairs. –This indicates that users give more focus toward those terms, which are closer together in the text. –In All Terms Pairs of Patents, 71% of Pairs have Minimum Proximity > 7.

Terms Proximity/Closeness in Query Log  With Co-Occurrence Frequency with Window Size = 14: –Higher Co-Occurrence Pairs are appeared in a larger percentage (90%) in Query Log, as compared to Lower Co-Occurrence Pairs (10%). –Almost 75% of All Pairs of Patents have Co-Occurrence Frequency <= 1.

Frequency in Similar IPC Classes  Query Patents fall in many IPC Classes.  Patent Users are usually experienced.  Their terms are more target oriented.  Need to check what is the Frequency of Query-Log Terms Pairs similar IPC classes. –Freq (IPC Classes) = Freq / |q d | Freq = Frequency in similar IPC Classes |q d | =Total # of Retrieved Documents.

Support in IPC Classes  Analysis indicates higher support of QUERY Log Terms Pairs in similar IPC classes as compared to All Terms Pairs of Patents.

Number of Retrieved Documents  Number of Retrieved Document denotes, QUERY Terms are present in how many Patents.  More common the QUERY Terms will be, the Larger Number of Retrieved Documents will be  This factor is analyzed with –All Terms Pairs of Patent –All Terms Pairs of Query Log

Number of Retrieved Documents  Analysis indicates Terms Pairs of Query Log, can retrieve smaller number of Patents as compared to All Terms Pairs of Patents.

Conclusion  For automatic IR System evaluation, Query Generation is an important factor.  We believe on the basis of past Query Log, we can understand this problem.  Using different statistical factors, there exists a huge difference between random queries and users queries.  We can considered these factors, while generating automatic queries.

 Thank You