Download presentation
Presentation is loading. Please wait.
Published byWinifred Hood Modified over 9 years ago
1
Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information University of California, Berkeley
2
September 21, 2007CLEF 2009 -- Corfu, Greece GRID@CLEF Task Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process
3
September 21, 2007CLEF 2009 -- Corfu, Greece GRID@CLEF Task One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words
4
September 21, 2007CLEF 2009 -- Corfu, Greece Adapting Cheshire II for GRID@CLEF Cheshire II is a suite of C programs for IR including over 150K lines of code Main programs are the indexer and several server and client programs where retrieval is performed Since identical text processing must be used in both indexing and search, those modules are shared across several programs Cheshire II is a suite of C programs for IR including over 150K lines of code Main programs are the indexer and several server and client programs where retrieval is performed Since identical text processing must be used in both indexing and search, those modules are shared across several programs
5
September 21, 2007CLEF 2009 -- Corfu, Greece Adapting Cheshire II for GRID@CLEF For this task we created a special version of the main Cheshire indexing program which included: A new module to output the XML streams A significant number of changes to the source code for particular modules Many changes involved passing more information into lower levels of the call hierarchy via new parameters For this task we created a special version of the main Cheshire indexing program which included: A new module to output the XML streams A significant number of changes to the source code for particular modules Many changes involved passing more information into lower levels of the call hierarchy via new parameters
6
September 21, 2007CLEF 2009 -- Corfu, Greece Issues The tasks assume “bag of words” But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags The tasks assume “bag of words” But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags
7
September 21, 2007CLEF 2009 -- Corfu, Greece Issues No specification of how unique identifiers for tokens, documents, etc are to be derived In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done) There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing Other participants made different choices, revealing a challenge for interoperability No specification of how unique identifiers for tokens, documents, etc are to be derived In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done) There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing Other participants made different choices, revealing a challenge for interoperability
8
September 21, 2007CLEF 2009 -- Corfu, Greece <circo xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/ 2001/XMLSchema-instance" xmlns="http://circo.dei.unipd.it/" xsi:schemalocation=" http://circo.dei.unipd.it/ http://ims.dei.unipd.it/xml/circo-schema-instance" xm lns:dc="http://purl.org/dc/elements/1.1/"> Cheshire II Grid Version Copyright (c) 1990-2009 Regents of the University of California, All Righ ts Reserved. Thu Aug 20 18:42:31 2009 <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATI MES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE "> <component identifier="cheshire_idxdata1" type="tokenizer" descr iption="A tokenizer separates an input document into a stream of tokens.">
9
September 21, 2007CLEF 2009 -- Corfu, Greece <resource identifier="/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1" mime-type="text/plain"> <stream identifier="Cheshire_Raw_Tokens_/project s/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-c hunk="false" digest-type="NONE" /> tokens> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-0" value="LA070294-0001"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-1" value="LA070294"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-2" value="056774"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/>
10
September 21, 2007CLEF 2009 -- Corfu, Greece Sizes of Output Files
11
September 21, 2007CLEF 2009 -- Corfu, Greece Conclusions Turned out to be useful in uncovering unrecognized bugs in the system E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead Revisiting the text processing of the system suggested some new possible functions at this level Turned out to be useful in uncovering unrecognized bugs in the system E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead Revisiting the text processing of the system suggested some new possible functions at this level
12
September 21, 2007CLEF 2009 -- Corfu, Greece Conclusions The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.