Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

HData History : Work on hData begins Mid-2011: ONC NwHIN team looks at hData and REST 2010: hData receives MIP funding.
XML Parsing Using Java APIs AIP Independence project Fall 2010.
Information Retrieval in Practice
December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.
1 File Output. 2 So far… So far, all of our output has been to System.out  using print(), println(), or printf() All input has been from System.in 
J4www/jea Week 3 Version Slide edits: nas1 Format of lecture: Assignment context: CRUD - “update details” JSP models.
Direct Congress Dan Skorupski Dan Vingo 15 October 2008.
1 Indexing and Querying XML Data for Regular Path Expressions A Paper by Quanzhong Li and Bongki Moon Presented by Amnon Shochot.
Evaluating the Performance of IR Sytems
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Parallel and Distributed IR
1 Final Year Project 2003/2004 LYU0302 PVCAIS – Personal Video Conference Archives Indexing System Supervisor: Prof Michael Lyu Presented by: Lewis Ng,
HTTP Overview Vijayan Sugumaran School of Business Administration Oakland University.
Overview of Search Engines
OpenMDR: Generating Semantically Annotated Grid Services Rakesh Dhaval Shannon Hastings.
Selecting and Combining Tools F. Duveau 02/03/12 F. Duveau 02/03/12 Chapter 14.
1 Shawlands Academy Higher Computing Software Development Unit.
DE&T (QuickVic) Reporting Software Overview Term
Kenny Trytek Joe Briggie Abby Birkett Derek Woods Advisor: Simanta Mitra Client: Matt Good, Kingland Systems.
Programming Project (Last updated: August 31 st /2010) Updates: - All details of project given - Deadline: Part I: September 29 TH 2010 (in class) Part.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Interactive Probabilistic Search for GikiCLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information.
Question of the Day  On a game show you’re given the choice of three doors: Behind one door is a car; behind the others, goats. After you pick a door,
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
14 October 2010 Leveraging Technical Expertise via Boeing Library Services* Diane Brenes, Librarian, Boeing Library & Learning Center Services.
Design Concepts By Deepika Chaudhary.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
MSI Information using XML, XSLT, & CVS Kakapo Meeting August 28, 2003.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
IR Homework #1 By J. H. Wang Mar. 5, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
WDDX Case Study: Building a Cross CFUG Search April Fleming.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Session 1 Module 1: Introduction to Data Integrity
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
34 Copyright © 2007, Oracle. All rights reserved. Module 34: Siebel Business Services Siebel 8.0 Essentials.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
The Akoma Ntoso Naming Convention Fabio Vitali University of Bologna.
IR Homework #1 By J. H. Wang Mar. 25, Programming Exercise #1: Indexing Goal: to build an index for a text collection using inverted files Input:
Chapter – 8 Software Tools.
Hello world !!! ASCII representation of hello.c.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Access Grid Workshop – APAC ‘05 Node Services Development Thomas D. Uram Argonne National Laboratory.
NCI CBIIT LIMS ISIG Meeting– July 2007 NCI CBIIT LIMS Consortium Interface SIG Mission: focus on an overall goal of providing a library of interfaces/adapters.
Data Integrity & Indexes / Session 1/ 1 of 37 Session 1 Module 1: Introduction to Data Integrity Module 2: Introduction to Indexes.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Information Retrieval in Practice
Architecture Review 10/11/2004
Supervisor: Prof Michael Lyu Presented by: Lewis Ng, Philip Chan
Modified from Stanford CS276 slides Lecture 4: Index Construction
SOFTWARE DESIGN AND ARCHITECTURE
On the Criteria to Be Used in Decomposing Systems into Modules
Map Reduce.
CS 430: Information Discovery
OGSA Data Architecture Scenarios
Stack Data Structure, Reverse Polish Notation, Homework 7
Chapter 5 Designing the Architecture Shari L. Pfleeger Joanne M. Atlee
Chapter 10 ADO.
The ultimate in data organization
PHP-II.
Chapter 1: Creating a Program.
Implementation Plan system integration required for each iteration
Presentation transcript:

Decomposing Text Processing for Retrieval: Cheshire tries Ray R Larson School of Information University of California, Berkeley Ray R Larson School of Information University of California, Berkeley

September 21, 2007CLEF Corfu, Greece Task  Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems  This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process  Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems  This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process

September 21, 2007CLEF Corfu, Greece Task  One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively  such as decompounding German words  One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively  such as decompounding German words

September 21, 2007CLEF Corfu, Greece Adapting Cheshire II for  Cheshire II is a suite of C programs for IR including over 150K lines of code  Main programs are the indexer and several server and client programs where retrieval is performed  Since identical text processing must be used in both indexing and search, those modules are shared across several programs  Cheshire II is a suite of C programs for IR including over 150K lines of code  Main programs are the indexer and several server and client programs where retrieval is performed  Since identical text processing must be used in both indexing and search, those modules are shared across several programs

September 21, 2007CLEF Corfu, Greece Adapting Cheshire II for  For this task we created a special version of the main Cheshire indexing program which included:  A new module to output the XML streams  A significant number of changes to the source code for particular modules  Many changes involved passing more information into lower levels of the call hierarchy via new parameters  For this task we created a special version of the main Cheshire indexing program which included:  A new module to output the XML streams  A significant number of changes to the source code for particular modules  Many changes involved passing more information into lower levels of the call hierarchy via new parameters

September 21, 2007CLEF Corfu, Greece Issues  The tasks assume “bag of words”  But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing  E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags  The tasks assume “bag of words”  But Cheshire is an SGML/XML search system, but the tasks as currently defined did not consider structural analysis and facetted indexing  E.g. there is no provision for multiple indexes taken from different parts of the overall records determined by the SGML/XML tags

September 21, 2007CLEF Corfu, Greece Issues  No specification of how unique identifiers for tokens, documents, etc are to be derived  In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)  There are also term id numbers assigned to unique terms in an index  But not until a much later stage in our normal processing  Other participants made different choices, revealing a challenge for interoperability  No specification of how unique identifiers for tokens, documents, etc are to be derived  In Cheshire II the unique document identifier is just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)  There are also term id numbers assigned to unique terms in an index  But not until a much later stage in our normal processing  Other participants made different choices, revealing a challenge for interoperability

September 21, 2007CLEF Corfu, Greece <circo xmlns:xs=" xmlns:xsi=" 2001/XMLSchema-instance" xmlns=" xsi:schemalocation=" xm lns:dc=" Cheshire II Grid Version Copyright (c) Regents of the University of California, All Righ ts Reserved. Thu Aug 20 18:42: <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATI MES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE "> <component identifier="cheshire_idxdata1" type="tokenizer" descr iption="A tokenizer separates an input document into a stream of tokens.">

September 21, 2007CLEF Corfu, Greece <resource identifier="/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1" mime-type="text/plain"> <stream identifier="Cheshire_Raw_Tokens_/project s/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-c hunk="false" digest-type="NONE" /> tokens> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-0" value="LA "> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-1" value="LA070294"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/ LATIMES94.tar-1-2" value="056774"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.ta r-1" mime-type="text/plain"/>

September 21, 2007CLEF Corfu, Greece Sizes of Output Files

September 21, 2007CLEF Corfu, Greece Conclusions  Turned out to be useful in uncovering unrecognized bugs in the system  E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both  Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead  Revisiting the text processing of the system suggested some new possible functions at this level  Turned out to be useful in uncovering unrecognized bugs in the system  E.g. Dual extraction for hyphenated terms was only extracting the first term of a hyphenated pair, not both  Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead  Revisiting the text processing of the system suggested some new possible functions at this level

September 21, 2007CLEF Corfu, Greece Conclusions  The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages