Download presentation
Presentation is loading. Please wait.
Published byPolly Benson Modified over 9 years ago
1
TREC-CHEM The TREC Chemical IR Track Mihai Lupu 1, John Tait 1, Jimmy Huang 2, Jianhan Zhu 3 1 Information Retrieval Facility 2 York University 3 University College London 1 Network of excellence co-funded by the 7 th Framework Program of the European Comission, grant agreement number 258191
2
Agenda Introduction „Prior Art“ Task (PA) „Technology Survey“ Task (TS) Conclusions 2
3
Motivation Increased awareness on behalf of the industry and regulatory authorities – Particularly in human-related chemistry (pharma and cosmetics) – Particularly in IP-related contexts Increased availability of data and meta- data Different demands from professional users wrt other evaluation campaigns 3
4
Partners Collaboration – National Institute for Science and Technology (US) – University College London (UK) – York University (Canada) Support from – Royal Society of Chemistry – Open access publishers – Experts in the field With the participation of – Research groups 4
5
Aims Assess the available Chemical Retrieval tools Generate interest among research groups for this domain Stimulate participation from industry Generate new Chemical Retrieval tools, at the intersection of chemoinformatics and text-mining 5
6
Data 2 collections 2009 – 1.2 million patent documents – 50k scientific articles – text only 2010 – 1.3 million patent documents – 172k scientific articles – text, images, structure information available 6
7
2010 Data Patent data – Addition of WIPO patents – Addition of attachments (images, structure data) Scientific articles – 3-fold increase, with attachments – Large mass from PubMed – Some directly from open access publishers: IUCrJnls, Oxford Publishers, Hindawi Publishers, MPCI 7
8
2010 Data Patent data across IPC classes Organic Chemistry Medical or Veterinary science; Hygiene Organic macromolecular compounds BioChemistry Physical or chemical processes or apparatus in general Dyes; Paints; Polishes… Petroleum; Gas.. 8
9
Tasks Technology Survey (TS) – Search for all potentially relevant documents, in both patents and scientific articles. – 30 manually defined and evaluated topics Prior Art (PA) – Search for patents that may invalidate a given patent – 1000 automatically created and evaluated topics (1000 patent files) 9
10
PA topics Tagline: recreate the citation list created by the patent examiner topic = patent application document evaluation based on – applicant’s citations – examiner’s report – opposition citations (if any) only patent corpus used 10
11
PA topics 11
12
TS topics topic = natural language information request evaluation done manually by – junior evaluators (students, others) – senior evaluators (topic creators) both patent and scientific articles requested 12
13
TS topics -example TS-23 Titanium tetrafluoride for improving dental health Titanium tetrafluoride can be used to prevent dental caries or tooth decay along with other fluoride containing compounds. We are specifically looking for the use of Titanium tetrafluoride for improving dental health or preventing decay. titanium tetrafluoride tooth decay A document will be considered RELEVANT if it refers to the use of titanium tetrafluoride for improving dental health, including caries or tooth decay A document will be considered HIGHLY RELEVANT when it is RELEVANT and it refers to the use of titanium tetrafluoride within a product such as toothpaste or mouthwash. 13
14
TS topics - example TS-47 Structure Search We are looking for patents and papers on use of the chemical described in TS-47.mol and TS-47.png for treating dementia. A document will be considered RELEVANT if it refers to the use of chemical X for treating dementia There are no HIGHLY RELEVANT documents. 14
15
Participants 13 participants registered to download the data PA – 4 submitted 10 runs – BiTeM Geneva, York University, Fraunhfer SCAI, Iowa University TS – 2 submitted 12 runs – BiTeM Geneva, York University 15
16
Methods Basic Probabilistic Model, Language Model and Vector Space Model – Different sections, weights on each section – bm25 Additional filtering/weighting based on IPC codes Linguistic processing – Emphasis on NP Concept based search – Query expansion – Using Oscar3, MeSH 16
17
Methods The addition of non-text data did not impact the methods – only 2 TS topics were purely structure based TODO – define interesting structure based topics – find ways to solve them 17
18
Evaluation – PA topics Topic Patent D D D D cites Family Member sibling F1 cites F2 F3 18
19
Evaluation PA topics qrels 19
20
Evaluation TS topics – Due to low participation -> pooling method might have resulted in biased results – However, still wanted to provide feedback to the 2 participating groups – Evaluated 6 topics: TS-21, TS-23, TS-30, TS-35, TS-36 and TS-43 20
21
Evaluation – TS Interface TS topics - interface 21
22
Evaluation – TS interface TS topics - interface 22
23
Evaluation TS topics – qrels Topic#pooled#sampled#relevant#highly relevant #non relevant TS-214500616162597 TS-23476264824641 TS-30385252553517 TS-35603679753789 TS-3650486796213594 TS-4360057617415672 23
24
Results – Prior Art Task 24
25
Results – TS task 25
26
Results – TS Task 26
27
Conclusions & Outlook This year, more than the last, was a dry- run for the next campaign Fixed test collection 24 TS topics still to use next year Main objective for 2011 – More collaboration between structure-based search and text-mining 27
28
Thank you Questions 28
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.