A Study on Query Expansion Methods for Patent Retrieval Walid MagdyGareth Jones Centre for Next Generation Localisation School of Computing Dublin City University 24 October 2011
Outline What is the Problem? Why Patents? Current Solutions Testing Existing Approaches New Approach Results Conclusion Motivation Patent Characteristics Prior Work Applying Standard QE Novel Method Outcome Findings Agenda
Why Patents? Challenging wording Using vague and general terms Strange combination of terms No defined query (what words to select for search?) Low retrieval effectiveness Recall-oriented IR task Hypothesis: QE better query/doc match better results
Prior Work Pseudo Relevance Feedback (PRF) (Kishida K, NTCIR-3; Itoh H, NTCIR-4) QE using Rocchio formula: no significant improvement QE using Taylor formula: no significant improvement Reweighting query terms using PRF: no significant improvement Inter Query Expansion (QE) for Patent Invalidity Search (Takeuchi H. et al, NTCIR-5) QE for individual claims from same patent topic: significant improvement, but not applicable for other patent search tasks Improving Retrievability for Patents (Bashir and Rauber, ECIR 2010) Enrich queries to improve the retrievability of patents with low chance of retrieval, but not tested for real patent search task
Testing QE for Prior-Art Patent Search CLEF-IP 2010: 1.35M patents from the EPO 1.35K English patent topics Collection contains EN/FR/DE patents, with translations of titles and claims in three languages Expand query by: PRF vs. WordNet Use (Magdy et al., 2011) as BL without citation extraction (full patent description section as query) MAP and PRES was used for evaluation BL: 0.14 MAP, PRES
Applying Pseudo Relevance Feedback PRF implemented in Indri was used Different values of FB terms and docs was tested Terms Docs MAP BL = PRES BL =
Using WordNet for Expansion Expand terms in query using synonyms, hyponyms for nouns and verbs Apply QE to sample 100 topics, then use best combination to the full 1.35k topics set MAPPRES value%changevalue%change Baseline0.1668NA0.584NA NS % % NS+NH % % NS+VS % % NS+NH+VS+VH % % Baseline0.1399NA0.486NA WordNet (NS) % %
Standard QE Approaches PRF: Significant degradation in retrieval effectiveness. This can be expected due to the low initial retrieval precision WordNet: Statistically significant degradation of results, but with some successful instances (31% of topics) Large reduction in retrieval speed, since average query size is at least 5 times larger (34 times larger for the NS+NH+VS+VH) A new effective and efficient QE method is required!
Automatically Generated SynSet Align Sentences Remove Stopwords Stem Words Align Terms Backoff Alignment English fields French transl. EN FR terms dic. FR EN terms dic. EN EN terms dic. process for eliminating foreign matter from a waste heat stream procédé pour éliminer de la matière étrangère d'un courant de chaleur perdue process elimin foreign matter wast heat stream procéd élimin mati étrangèr cour chaleur perdu elimin: élimin 0.71 elimin 0.13 élimin: remov 0.71 elimin 0.14 elimin: remov 0.6 elimin 0.16 elimin: remov 0.85 elimin 0.15
Samples of the Output motorweighttravelcolorlink motor motor0.64 engin engin0.36 weight weight0.86 wt wt0.14 travel travel0.67 move move0.19 displac displac0.14 color color0.56 colour colour0.25 dye dye0.19 link link0.4 connect connect0.18 bond bond0.17 crosslink crosslink0.13 bind bind0.12 clothtubeareagameplay fabric fabric0.36 cloth cloth0.3 garment garment0.2 tissu tissu0.14 tube tube0.88 pipe pipe0.12 area area0.4 zone zone0.23 region region0.2 surfac surfac0.17 set set0.6 game game0.4 set set0.3 play play0.24 read read0.2 game game0.16 reproduc reproduc0.1
SynSet QE Results 8M parallel EN/FR sentences were extracted from EPO patent collection to generate SynSets Two runs were adopted: Expanding query using SynSet without weights (Usynset) Utilizing SynSet probabilities as weights to terms in query MAPPRES value%changevalue%change Baseline0.1399NA0.486NA Wsynset % % Usynset % %
SynSet Expansion Significantly better MAP, but significantly worse PRES i.e. better retrieval at very high ranks, but worse ranking of relevant results over all ranks and less recall Some topics were improved (34% of topics), but some were degraded (39% of topics). Significantly more efficient than PRF and WordNet (query size is only 60% larger)
Deeper Look on SynSet No features with high correlation to SynSet QE success Initial retrieval quality of BL does not relate to the performance of QE Topic IDBaselineWsynset%change Topic IDBaselineWsynset%change PAC ∞ PAC % PAC ∞ PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC % PAC %
Conclusions PRF is not effective with patent prior-art search WordNet QE for patent search: Leads to overall significant degradation of retrieval Has some positive impact on the retrieval of some topics High computational cost SynSet QE for patent search: The most effective and efficient QE technique among those tested Significant improvement for very high ranks, but significant degradation of overall ranking and recall No indication of when it fails/succeeds SynSet can be used as a lexical resource for patent examiners
Future Work More analysis to better understand when QE fails/succeeds Applying SynSet on real patent examiners’ queries rather than automatically formulated queries Combining different QE methods Alternative methods for query modification, for example query reduction (QR)
Please Check in CIKM Poster Session Magdy W. and G. J. F. Jones. An Efficient Method for Using Machine Translation Technologies in Cross-Language Patent Search. Ganguly D., J. Leveling, W. Magdy, and G. J. F. Jones. Query Reduction based on Pseudo-Relevant Documents. Thank you