Download presentation
Presentation is loading. Please wait.
Published byRosamund Dawson Modified over 9 years ago
1
Multiple Retrieval Models and Regression Models for Prior Art Search Participating institution: Humboldt Universität zu Berlin - IDSL Patrice Lopez also at EPO Berlin, Germany Laurent Romary INRIA Gemo - Saclay, France HUB-IDSL – Berlin, Germany
2
Plan Searching Scientific and Technical Documents Issues related to Prior Art Search Overview of PATATRAS Patent Document processing Combining metadata & text in four steps Results Future work
3
PATATRAS !! PATent and Article Tracking, Retrieval and AccesS addresses Scientific and Technical Publications in general: Scientific and Technical Publications have 5 dimensions: 1.metadata 2.document structure 3.textual content 4.supporting content 5.experimental data How is this instantiated in patent publications?
4
Patent Publications 1. Metadata encode procedure-related data: –Date, applicant, inventors, language(s) –Classification: hierarchy of technical fields IPC, ECLA (+ICO) G06F17/30T2P2X –Citations Information retrieval Query expansion
5
EPO Citation Statistics EPO Search Reports produced in the last 5 years (tot. 775.000) EC classified Pat: 95% NPL: 24% JP: 17% 1%
6
Patent Publications 1. Metadata encode procedure-related data: –Date, applicant, inventors, language(s) –Classification: hierarchy of technical fields IPC, ECLA (+ICO) G06F17/30T2P2X –Citations 2.Patent Document Structure: Title, Abstract, Claims, Description (description of prior art, "subjective" technical problem, description of embodiments) –Strong interrelations between these structures –Each of these structures serves different goals Information retrieval Query expansion
7
Patent Publications 3. Textual Content of Patent: Attornish, multilinguality 4. Supporting content: tables, mathematical and chemical formulas, citations, technical drawing, etc. 5. Experimental data: absent
8
PATATRAS !! Scientific and Technical Publications have 5 dimensions: 1.metadata 2.document structure 3.textual content 4.supporting content 5.experimental data What are the known practices in prior art search ?
9
Prior art search Search report Patent application Prior art Topic patents are granted patent publication (richer documents then applications; ECLA classes, extra citations; final claims) Result set extended to all EPO documents introduced during the prosecution the application All documents without EPO counterpart (via patent family) are discarded (patent applications never filed at the EPO and non-patent literature omitted) CLEF-IP biases Motivation and approach: Non exhaustivity Recall-oriented search is a myth (titles and abstracts; lack of elaborate tool) Usage of classification for search (a priori restriction of the result set) Usage of meta-data (thickets; patents are continuation of previous applications) The patent examiner’s real life
10
PATATRAS !! Scientific and Technical Publications have 5 dimensions: 1.metadata 2.document structure 3.textual content 4.supporting content 5.experimental data We investigated only 1 and 3 in CLEF IP 2009 However... how to combine metadata-based and text-content retrieval? How to combine results in different languages? How to combine different retrieval approaches?
11
Overview of PATATRAS Tokenization POS Tagging Phrase Extraction Concept Tagging Final Ranked Results Index Lemma en Index Lemma fr Index Lemma de Index Phrase en Index Concept Lemur 4.9 - KL divergence - Okapi BM25 Ranked Results (10) Query Lemma en Query Lemma fr Query Lemma de Query Phrase en Query Concept Init Initial Working Set Post-Ranking Ranked Merged Results Merging Patent Collection Patent Topic
12
Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS
13
Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS
14
Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging Overview of PATATRAS
15
Patent Document Processing: Text Indexing Sound linguistic processing as groundwork: –No stemming: POS tagging & lemmatisation –No stop words: Only open grammatical categories are considered (N, V, Adj., Adv., numbers) A total of 5 indexes: –One word form* (lemma) index per language (en, fr, de) –English phrase indexing (Dice coefficient) –Conceptual indexing *ISO/DIS 24611, Language resource management — Morpho-syntactic annotation framework
16
Conceptual indexing Creation of a multilingual terminological database base based on a conceptual model* covering scientific & technical fields Sources: MeSH, UMLS, Gene Ontology, SUMO, WordNet/WordNet- Domains/WOLF, Wikipedia en/fr/de Merging on concept based on: –Domain matches (manual mappings between sources) –Term matches Represent terms/term variants/synonyms/acronyms and multilingual correspondences Term disambiguation based on IPC class 2,6 millions terms for en, 190.000 for de, 140.000 for fr 1,4 millions concepts (71.000 realized in de, 65.000 in fr) *ISO 16642:2003, Computer applications in terminology — Terminological markup framework
18
Limitations of text-only retrieval Queries are based on all the textual content of the topic patent documents Model Index Language basewith citation text KL lemma en 0.1068 0.1083 KL lemma fr 0.0611 0.0612 KL lemma de 0.0627 0.0634 KL phrase en 0.0717 0.0720 KL concept all 0.0671 0.0680 Okapi lemma en 0.0806 0.0813 Okapi lemma fr 0.0301 0.0303 Okapi lemma de 0.0598 0.0612 Okapi phrase en 0.0328 0.0330 Okapi concept all 0.0510 0.0516
19
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
20
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
21
Patent Document Processing: Metadata Additional extraction of cited patents in the descriptions (regular expressions) 7960 additional cited EP doc. found in XL set Metadata representation: basic normalization (author, applicant), Storage in a MySQL database (total 2,48 Go for the collection)
22
Prior working sets Goal: For a given patent topic, create the smallest set of patents containing the relevant documents Iterative expansion from a core list of documents based on metadata: citation tree, common applicant/author, patent family relation, classifications → patent examiner's strategies Result: micro-recall of 0.7303, approx. 2600 doc. per patent topic (415 results per topic after final cutoff) Significant improvement of MAP results: Model Index Language with cit. textwith prior sets KL lemma en 0.10830.1516 (+40%) KL lemma de 0.06340.1145 (+81%) KL phrase en 0.0720 0.1268 (+76%) Okapi lemma en 0.08130.1365 (+68%)
23
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
24
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
25
Merging of results Strong complementarities between the results sets So many examples ! → fully supervised ML Regression model for estimating for each patent topic the pertinence of a result set Features: language, query size, init. working set size, max./min. & range of retrieval scores, IPC main & class, average phrase length Training set: 500 + Addition of 4131 patents of the collection Linear combination of weights:
26
Merging of results Feat.LeastMedSq MP SMO ν-SVM f1 0.1681 (+5.8%) 0.1711 (+7.7%) 0.1706 (7.4) 0.1691 (+6.4%) f1-6 0.1689 (+6.3%) 0.1797 (+13.1%) 0.1807 (+13.7) 0.1976 (+24.3%) all 0.1786 (+12.4) 0.1898 (+19.4%) 0.2016 (+26.9%) 0.2281 (+43.5%) f1 language f2-6related to the retrieval score f7-8 IPC (domains) f9 av. phrase length
27
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
28
Overview of PATATRAS Lemur 4.9 - KL divergence - Okapi BM25 Init Post-Ranking Merging
29
Post-ranking Regression model for estimating the pertinence of a patent in the result set for a given patent topic Features: citation, # of common IPC & ECLA classes, prob. of citation, same applicant & inventors Training set: 500 + Addition of 4131 patents of the collection
30
Final Results Measures S M XL en-XL fr-XL de-XL MAP 0.2714 0.2783 0.2802 0.2358 0.1787 0.2092 Prec. at 5 0.2780 0.2766 0.27680.2365 0.1855 0.2122 Prec. at 10 0.1768 0.1748 0.1776 0.1575 0.1338 0.1467 In average approx. 43s per topic Final runs (10.000 patent topics) for all, en, fr, de took 5 days on 4 machines
31
Conclusion We have proposed an architecture for retrieving Scientific and Technical Publications We have adapted the architecture to patent search practices Need – improve terminological representations – address document structures – refine query representations Full text available in HAL: http://hal.archives-ouvertes.fr/hal-00411835/fr/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.