Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson.

Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information University of California, Berkeley

September 21, 2007CLEF 2009 -- Corfu, Greece GRID@CLEF Task  Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems  This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process  Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems  This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process

September 21, 2007CLEF 2009 -- Corfu, Greece GRID@CLEF Task  One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively  such as decompounding German words  One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively  such as decompounding German words

September 21, 2007CLEF 2009 -- Corfu, Greece Adapting Cheshire II for GRID@CLEF  Cheshire II is a suite of C programs for IR including over 150K lines of code  Main programs are the indexer and several server and client programs where retrieval is performed  Since identical text processing must be used in both indexing and search, those modules are shared across several programs  Cheshire II is a suite of C programs for IR including over 150K lines of code  Main programs are the indexer and several server and client programs where retrieval is performed  Since identical text processing must be used in both indexing and search, those modules are shared across several programs

September 21, 2007CLEF 2009 -- Corfu, Greece Adapting Cheshire II for GRID@CLEF  For this task we created a special version of the main Cheshire indexing program which included:  A new module to output the XML streams  A significant number of changes to the source code for particular modules  Many changes involved passing more information into lower levels of the call hierarchy via new parameters  For this task we created a special version of the main Cheshire indexing program which included:  A new module to output the XML streams  A significant number of changes to the source code for particular modules  Many changes involved passing more information into lower levels of the call hierarchy via new parameters

September 21, 2007CLEF 2009 -- Corfu, Greece Issues  The tasks assume “bag of words”  But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing  E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags  The tasks assume “bag of words”  But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing  E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags

September 21, 2007CLEF 2009 -- Corfu, Greece Issues  No specification of how unique identifiers for tokens, documents, etc are to be derived  In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)  There are also term id numbers assigned to unique terms in an index  But not until a much later stage in our normal processing  Other participants made different choices, revealing a challenge for interoperability  No specification of how unique identifiers for tokens, documents, etc are to be derived  In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)  There are also term id numbers assigned to unique terms in an index  But not until a much later stage in our normal processing  Other participants made different choices, revealing a challenge for interoperability

September 21, 2007CLEF 2009 -- Corfu, Greece <circo xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/ 2001/XMLSchema-instance" xmlns="http://circo.dei.unipd.it/" xsi:schemalocation=" http://circo.dei.unipd.it/ http://ims.dei.unipd.it/xml/circo-schema-instance" xm lns:dc="http://purl.org/dc/elements/1.1/"> Cheshire II Grid Version Copyright (c) 1990-2009 Regents of the University of California, All Righ ts Reserved. Thu Aug 20 18:42:31 2009 <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATI MES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE "> <component identifier="cheshire_idxdata1" type="tokenizer" descr iption="A tokenizer separates an input document into a stream of tokens.">

September 21, 2007CLEF 2009 -- Corfu, Greece <resource identifier="/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1" mime-type="text/plain"> <stream identifier="Cheshire_Raw_Tokens_/project s/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-c hunk="false" digest-type="NONE" /> tokens> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-0" value="LA070294-0001"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-1" value="LA070294"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-2" value="056774"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/>

September 21, 2007CLEF 2009 -- Corfu, Greece Sizes of Output Files

September 21, 2007CLEF 2009 -- Corfu, Greece Conclusions  Turned out to be useful in uncovering unrecognized bugs in the system  E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both  Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead  Revisiting the text processing of the system suggested some new possible functions at this level  Turned out to be useful in uncovering unrecognized bugs in the system  E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both  Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead  Revisiting the text processing of the system suggested some new possible functions at this level

September 21, 2007CLEF 2009 -- Corfu, Greece Conclusions  The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages

Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson.

Similar presentations

Presentation on theme: "Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson.

Similar presentations

Presentation on theme: "Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson."— Presentation transcript:

Similar presentations

About project

Feedback