Download presentation
Presentation is loading. Please wait.
Published byMarshall Fletcher Modified over 8 years ago
1
10/24/2002R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium, 2002
2
10/24/2002R. Scott Cost - CADIP, UMBC2 Mission Serve the current and future information needs of the community through the construction of a powerful yet flexible, high- bandwidth distributed IR system, which can integrate information from a variety of sources Create a testbed for research in a variety of IR issues Foster new and ongoing IR research at UMBC, CADIP’s affiliates and sponsor organization
3
10/24/2002R. Scott Cost - CADIP, UMBC3 Reports Presentation will consist of three reports: Project Progress and Status TREC Participation Current Student Research
4
10/24/2002R. Scott Cost - CADIP, UMBC4 1: Project Status Overview Current Status Progress Current Issues Goals Contact Details Summary
5
10/24/2002R. Scott Cost - CADIP, UMBC5 Overview During the past year, the C2 Project has made substantial progress towards its current goals, and has continued to expand and thrive, both in size and in variety of relevant research directions.
6
10/24/2002R. Scott Cost - CADIP, UMBC6 Status Currently, we have: A DIR system which is portable, scalable, and which has the potential to support mixed collections of information sources. Nodes for classic IR, web search, crawling. A Java-based IR engine (WONDIR). An integrated version of the Telltale IR engine.
7
10/24/2002R. Scott Cost - CADIP, UMBC7 Progress Since Last Review Completion of Telltale integration Advances in WONDIR’s scalability First formal C2 presentation, in Madrid First TREC participation Since Last Symposium Full, working C2 system WONDIR IR Engine S. Kallurkar’s Master’s Thesis
8
10/24/2002R. Scott Cost - CADIP, UMBC8 Current Issues Some issues of significant concern are: Scalability – Telltale and WONDIR need to index more data, and in less time. Metadata – Needs to be extended to support the integration of and fusion of results from different sources. Semantic Web – How can we use semantic markup in queries and handle it in text? Streams – The logical extension of large, extremely dynamic corpora.
9
10/24/2002R. Scott Cost - CADIP, UMBC9 3/6/12 (from 9/2001) 3: Exercise system and prepare initial results for publication. 6: Expand system. Heavy evaluation, and preparation for debut. 12: Extensions (routing algorithms, fusion, metadata combination…).
10
10/24/2002R. Scott Cost - CADIP, UMBC10 Goals (3/6/12) 3 Presentations at TREC Submissions to SIGIR, AAMAS and WWW 6 Resolution of scaling problems, indexing 2G/node easily Integration of semantic markup, ‘magnification’ 12 Successful second round of TREC Integration and fusion of multiple source types Support for data streams
11
10/24/2002R. Scott Cost - CADIP, UMBC11 Summary The C2 project is making steady progress towards its goal of high- bandwith IR from distributed, heterogeneous sources.
12
10/24/2002R. Scott Cost - CADIP, UMBC12 For More Information … For more details on the goals and design of the project, individuals are referred to documents on the Project site: http://www.csee.umbc.edu/~co st/carrot2/ C2 is powered by: Jackal – An Agent Communications Infrastructure. The WONDIR Engine. Telltale. * The C2 project is supported in part by the U.S. Department of Defense.
13
10/24/2002R. Scott Cost - CADIP, UMBC13 2: TREC Participation Overview TREC TREC’s WebTrack Topic Distillation Approach Results Plans Summary
14
10/24/2002R. Scott Cost - CADIP, UMBC14 Overview This year, C2 made its first successful entry in the TREC event.
15
10/24/2002R. Scott Cost - CADIP, UMBC15 TREC An annual event, organized by NIST, in which many IR groups gather to test their current system’s ability to solve various IR problems. The TREC event is organized into tracks, each of which focuses on a particular type of problem or data.
16
10/24/2002R. Scott Cost - CADIP, UMBC16 TREC’s Web Track Focus is web data. Data set: a crawl of the.gov domain. 18.1 Gigabytes 1.25 Million documents Crawled early 2002 Two tasks: Homepage Finding Topic Distillation
17
10/24/2002R. Scott Cost - CADIP, UMBC17 Topic Distillation Given an information need (query), find the best ‘resource page’ for that need. This is not necessarily the page which best matches the contents of the query; value is given to links to other pages of value as well.
18
10/24/2002R. Scott Cost - CADIP, UMBC18 Approach Given a collection of pages and a query: Compute query similarity to each page, using VSM and cosine similarity Consider 1000 top-ranked documents Decorate subcollection with similarities Employ a spreading activation function to propagate relevance Select the top ranked documents in the resulting graph
19
10/24/2002R. Scott Cost - CADIP, UMBC19 Results We submitted 5 runs: 2 Raw similarity Flood query to all nodes Send query to N best nodes 3 Integrating link topology information Variations on the same weight equation (last three runs based on similarity computed in first)
20
10/24/2002R. Scott Cost - CADIP, UMBC20 TREC Baseline Run
21
10/24/2002R. Scott Cost - CADIP, UMBC21 Baseline Diff. from Median
22
10/24/2002R. Scott Cost - CADIP, UMBC22 TREC TD Run
23
10/24/2002R. Scott Cost - CADIP, UMBC23 Plans for the Future In preparation for next year’s competition: Improve scale Investigate work in propagating information (this was a new area for us) Employ ideas from ongoing work in scent and credibility.
24
10/24/2002R. Scott Cost - CADIP, UMBC24 Summary For a first time entry, C2 did reasonably well Performance similar to median for baseline Performance below median with topology information
25
10/24/2002R. Scott Cost - CADIP, UMBC25 3: Student Research Overview Highlights Ongoing Research Spotlight on: Data Fusion Document Summarization Query Caching Open Questions Summary
26
10/24/2002R. Scott Cost - CADIP, UMBC26 Overview The C2 Project is a multi-faceted effort which encompasses a broad range of research questions. Many of these questions are currently being investigated by UMBC students, both within the context of the project’s goals, and as part of their own academic research.
27
10/24/2002R. Scott Cost - CADIP, UMBC27 Highlights Srikanth Kallurkar Yongmei Shi Hemali Majithia Christopher James Akshay Java Sachin Bhatkar Dayn Harum Sowjanya Rajavaram Matt Siegel Drew Ogle
28
10/24/2002R. Scott Cost - CADIP, UMBC28 Highlights: S. Kallurkar Ph.D. Student Topic: Results Fusion (Masters Topic: Clustering) C2 Technical Lead Wrote the first C2 Masters Thesis, on online clustering in a DIR system.
29
10/24/2002R. Scott Cost - CADIP, UMBC29 Highlights: Y. Shi Ph.D. Student Research: Document Summarization for Metadata Metadata expert in residence Developer – C2 Web Search Agent Implemented first infrastructure prototype
30
10/24/2002R. Scott Cost - CADIP, UMBC30 Highlights: H. Majithia M.S. Student Topic: Query Caching in DIR Collection Librarian, TREC Liason Testing and Evaluation Developer - Query/Client agents
31
10/24/2002R. Scott Cost - CADIP, UMBC31 Highlights: C. James M.S. Student Topic: Inferring Document Credibility Java Performance Task Force Developer – GUI Query Interfaces
32
10/24/2002R. Scott Cost - CADIP, UMBC32 Highlights: A(kshay). Java M.S. Student Topic: Information Scent for Web Search Recently completed an internship at PARC Heading C2 task force on Java performance Developer - C2 Web Crawler agent
33
10/24/2002R. Scott Cost - CADIP, UMBC33 Highlights: S. Bhatkar M.S. Student Topic: Query Expansion/Enhancement Java Performance Task Force
34
10/24/2002R. Scott Cost - CADIP, UMBC34 Highlights: D. Harum M.S. Student Topic: Java Real Time Perfomance Monitoring (applied to WONDIR) Integrated monitoring code into SIRE file system, evaluated caching strategies.
35
10/24/2002R. Scott Cost - CADIP, UMBC35 Highlights: S. Rajavarum M.S. Student Topic: Protocols for Interaction in a Multi-Agent System Java Performance Task Force Newest member of the C2 team
36
10/24/2002R. Scott Cost - CADIP, UMBC36 Highlights: M. Siegel M.S. Student Employed by the Sponsor Worked on C2/Telltale integration Developer – Distributed file system layer
37
10/24/2002R. Scott Cost - CADIP, UMBC37 Highlights: T. Laufert M.S. Student Employed by the Sponsor Developer - Document flow visualization tools for C2
38
10/24/2002R. Scott Cost - CADIP, UMBC38 Highlights: D. Ogle Undergraduate Student Resident Telltale Engineer Integrated Telltale into the C2 system. Also provides Telltale support for ID group.
39
10/24/2002R. Scott Cost - CADIP, UMBC39 Spotlight: Data Fusion Results fusion is an essential component in the success of a distributed IR system. It is especially difficult when information sources in the system vary widely in content and form.
40
10/24/2002R. Scott Cost - CADIP, UMBC40 Spotlight: Document Summarization Successful collection selection and comparison depends on accurate metadata. Document summarization may lead us to the construction of more compact and richer metadata collection descriptions.
41
10/24/2002R. Scott Cost - CADIP, UMBC41 Spotlight: Query Caching By caching query results and returning approximate answers, we hope to reduce the overhead of repeatedly processing similar queries in a distributed environment.
42
10/24/2002R. Scott Cost - CADIP, UMBC42 Open Issues Semantic Web: There is much to be done still in integrating issues of the semantic web into C2. Indexing and enhancement of marked data Use of markup in routing and fusion Presentation of mixed-type results Data streams
43
10/24/2002R. Scott Cost - CADIP, UMBC43 Summary In the past 2+ years, the C2 project has generated and sustained significant interest and research in both practical and theoretical aspects of Distributed Information Retrieval. By the end of the Fall semester, C2 will have earned 3 Masters degrees, and will have contributed to several others.
44
10/24/2002R. Scott Cost - CADIP, UMBC44 Bibliography Cost et al., CARROT II: Collaborative Agent-based Routing and Retrieval of Text, Proceedings of the Fall 2001 CADIP Research Symposium. Cost et al., Integrating Distributed Information Sources with CARROT II, Proceedings of the Workshop on Cooperative Information Agents (CIA), 2002. Kallurkar, Document Migration in Distributed Information Retrieval, Masters Thesis for UMBC CSEE, 2002. In Preparation: Cost et al., ---, Proceedings of the Fall 2002 CADIP Research Symposium. Cost, WONDIR. Harum, ---, Masters Project for UMBC CSEE. Java et al., Integrating Web Sources with Distributed IR. Kallurkar et al., Comparison of Results Fusion Methods. Majithia, Investigation of Caching Mechanisms in Multi-Agent Based Architecture for Distributed Information Retrieval Systems, Masters Thesis for UMBC CSEE.
45
10/24/2002R. Scott Cost - CADIP, UMBC45 Bibliography… Also of note: T. Oates, V. Bhat, V. Shanbhag, Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text, Proceedings of WIDM, CIKM ’02. U. Shah, Information Retrieval on the Semantic Web, Masters Thesis, UMBC CSEE, Spring 2002. U. Shah, T. Finin, A. Joshi, R. S. Cost, J. Mayfield, Information Retrieval on the Semantic Web, Proceedings CIKM ’02.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.