10/24/2002R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium, 2002.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Metadata in Carrot II Current metadata –TF.IDF for both documents and collections –Full-text index –Metadata are transferred between different nodes Potential.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
The Experience Factory May 2004 Leonardo Vaccaro.
Information Retrieval in Practice
Search Engines and Information Retrieval
introduction to MSc projects
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Transactional Services Ricardo Jiménez-Peris Marta Patiño-Martínez Technical University of Madrid 1 st Adapt Workshop 23 rd -24 th September 2002 Madrid,
Workshop on Cyber Infrastructure in Combustion Science April 19-20, 2006 Subrata Bhattacharjee and Christopher Paolini Mechanical.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Parallel and Distributed IR
A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.
Overview of Search Engines
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
EIA : “Automated Understanding of Captured Experience” Georgia Institute of Technology, College of Computing Investigators: Irfan Essa, G. Abowd,
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Assignment 3: A Team-based and Integrated Term Paper and Project Semester 1, 2012.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Evaluating Centralized, Hierarchical, and Networked Architectures for Rule Systems Benjamin Craig University of New Brunswick Faculty of Computer Science.
Search Engines and Information Retrieval Chapter 1.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Web Content Development Dr. Komlodi Class 22: Wirerfames.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
The DiVA System: Current Status and Ongoing Development Uwe Klosa Electronic Publishing Centre, Uppsala University, Sweden Eva Müller.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Algoval: Evaluation Server Past, Present and Future Simon Lucas Computer Science Dept Essex University 25 January, 2002.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
OMIS Approach to Grid Application Monitoring Bartosz Baliś Marian Bubak Włodzimierz Funika Roland Wismueller.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Workshop on Future Learning Landscapes: Towards the Convergence of Pervasive and Contextual computing, Global Social Media and Semantic Web in Technology.
Enabling Peer-to-Peer SDP in an Agent Environment University of Maryland Baltimore County USA.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Freelib: A Self-sustainable Digital Library for Education Community Ashraf Amrou, Kurt Maly, Mohammad Zubair Computer Science Dept., Old Dominion University.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Sept 20-21, 2001R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Progress Report - Year 2 Extensions of the PhD Symposium Presentation Daniel McEnnis.
Task XX-0X Task ID-01 GEO Work Plan Symposium April 2014 Task ID-01 “ Advancing GEOSS Data Sharing Principles” Experiences related to data sharing.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Information Retrieval in Practice
Search Engine Architecture
Collection Fusion in Carrot2
Proposal for Term Project
Joseph JaJa, Mike Smorul, and Sangchul Song
IST 516 Fall 2011 Dongwon Lee, Ph.D.
iCrawl – Master Thesis and Hiwi Jobs
Document Visualization at UMBC
Query Caching in Agent-based Distributed Information Retrieval
Retrieval Evaluation - Reference Collections
Presentation transcript:

10/24/2002R. Scott Cost - CADIP, UMBC1 CARROT II Collaborative Agent-based Routing and Retrieval of Text, Version 2 CADIP Fall Research Symposium, 2002

10/24/2002R. Scott Cost - CADIP, UMBC2 Mission Serve the current and future information needs of the community through the construction of a powerful yet flexible, high- bandwidth distributed IR system, which can integrate information from a variety of sources Create a testbed for research in a variety of IR issues Foster new and ongoing IR research at UMBC, CADIP’s affiliates and sponsor organization

10/24/2002R. Scott Cost - CADIP, UMBC3 Reports Presentation will consist of three reports: Project Progress and Status TREC Participation Current Student Research

10/24/2002R. Scott Cost - CADIP, UMBC4 1: Project Status Overview Current Status Progress Current Issues Goals Contact Details Summary

10/24/2002R. Scott Cost - CADIP, UMBC5 Overview During the past year, the C2 Project has made substantial progress towards its current goals, and has continued to expand and thrive, both in size and in variety of relevant research directions.

10/24/2002R. Scott Cost - CADIP, UMBC6 Status Currently, we have: A DIR system which is portable, scalable, and which has the potential to support mixed collections of information sources. Nodes for classic IR, web search, crawling. A Java-based IR engine (WONDIR). An integrated version of the Telltale IR engine.

10/24/2002R. Scott Cost - CADIP, UMBC7 Progress Since Last Review Completion of Telltale integration Advances in WONDIR’s scalability First formal C2 presentation, in Madrid First TREC participation Since Last Symposium Full, working C2 system WONDIR IR Engine S. Kallurkar’s Master’s Thesis

10/24/2002R. Scott Cost - CADIP, UMBC8 Current Issues Some issues of significant concern are: Scalability – Telltale and WONDIR need to index more data, and in less time. Metadata – Needs to be extended to support the integration of and fusion of results from different sources. Semantic Web – How can we use semantic markup in queries and handle it in text? Streams – The logical extension of large, extremely dynamic corpora.

10/24/2002R. Scott Cost - CADIP, UMBC9 3/6/12 (from 9/2001) 3: Exercise system and prepare initial results for publication. 6: Expand system. Heavy evaluation, and preparation for debut. 12: Extensions (routing algorithms, fusion, metadata combination…).

10/24/2002R. Scott Cost - CADIP, UMBC10 Goals (3/6/12) 3 Presentations at TREC Submissions to SIGIR, AAMAS and WWW 6 Resolution of scaling problems, indexing 2G/node easily Integration of semantic markup, ‘magnification’ 12 Successful second round of TREC Integration and fusion of multiple source types Support for data streams

10/24/2002R. Scott Cost - CADIP, UMBC11 Summary The C2 project is making steady progress towards its goal of high- bandwith IR from distributed, heterogeneous sources.

10/24/2002R. Scott Cost - CADIP, UMBC12 For More Information … For more details on the goals and design of the project, individuals are referred to documents on the Project site: st/carrot2/ C2 is powered by: Jackal – An Agent Communications Infrastructure. The WONDIR Engine. Telltale. * The C2 project is supported in part by the U.S. Department of Defense.

10/24/2002R. Scott Cost - CADIP, UMBC13 2: TREC Participation Overview TREC TREC’s WebTrack Topic Distillation Approach Results Plans Summary

10/24/2002R. Scott Cost - CADIP, UMBC14 Overview This year, C2 made its first successful entry in the TREC event.

10/24/2002R. Scott Cost - CADIP, UMBC15 TREC An annual event, organized by NIST, in which many IR groups gather to test their current system’s ability to solve various IR problems. The TREC event is organized into tracks, each of which focuses on a particular type of problem or data.

10/24/2002R. Scott Cost - CADIP, UMBC16 TREC’s Web Track Focus is web data. Data set: a crawl of the.gov domain Gigabytes 1.25 Million documents Crawled early 2002 Two tasks: Homepage Finding Topic Distillation

10/24/2002R. Scott Cost - CADIP, UMBC17 Topic Distillation Given an information need (query), find the best ‘resource page’ for that need. This is not necessarily the page which best matches the contents of the query; value is given to links to other pages of value as well.

10/24/2002R. Scott Cost - CADIP, UMBC18 Approach Given a collection of pages and a query: Compute query similarity to each page, using VSM and cosine similarity Consider 1000 top-ranked documents Decorate subcollection with similarities Employ a spreading activation function to propagate relevance Select the top ranked documents in the resulting graph

10/24/2002R. Scott Cost - CADIP, UMBC19 Results We submitted 5 runs: 2 Raw similarity  Flood query to all nodes  Send query to N best nodes 3 Integrating link topology information  Variations on the same weight equation (last three runs based on similarity computed in first)

10/24/2002R. Scott Cost - CADIP, UMBC20 TREC Baseline Run

10/24/2002R. Scott Cost - CADIP, UMBC21 Baseline Diff. from Median

10/24/2002R. Scott Cost - CADIP, UMBC22 TREC TD Run

10/24/2002R. Scott Cost - CADIP, UMBC23 Plans for the Future In preparation for next year’s competition: Improve scale Investigate work in propagating information (this was a new area for us) Employ ideas from ongoing work in scent and credibility.

10/24/2002R. Scott Cost - CADIP, UMBC24 Summary For a first time entry, C2 did reasonably well Performance similar to median for baseline Performance below median with topology information

10/24/2002R. Scott Cost - CADIP, UMBC25 3: Student Research Overview Highlights Ongoing Research Spotlight on: Data Fusion Document Summarization Query Caching Open Questions Summary

10/24/2002R. Scott Cost - CADIP, UMBC26 Overview The C2 Project is a multi-faceted effort which encompasses a broad range of research questions. Many of these questions are currently being investigated by UMBC students, both within the context of the project’s goals, and as part of their own academic research.

10/24/2002R. Scott Cost - CADIP, UMBC27 Highlights Srikanth Kallurkar Yongmei Shi Hemali Majithia Christopher James Akshay Java Sachin Bhatkar Dayn Harum Sowjanya Rajavaram Matt Siegel Drew Ogle

10/24/2002R. Scott Cost - CADIP, UMBC28 Highlights: S. Kallurkar Ph.D. Student Topic: Results Fusion (Masters Topic: Clustering) C2 Technical Lead Wrote the first C2 Masters Thesis, on online clustering in a DIR system.

10/24/2002R. Scott Cost - CADIP, UMBC29 Highlights: Y. Shi Ph.D. Student Research: Document Summarization for Metadata Metadata expert in residence Developer – C2 Web Search Agent Implemented first infrastructure prototype

10/24/2002R. Scott Cost - CADIP, UMBC30 Highlights: H. Majithia M.S. Student Topic: Query Caching in DIR Collection Librarian, TREC Liason Testing and Evaluation Developer - Query/Client agents

10/24/2002R. Scott Cost - CADIP, UMBC31 Highlights: C. James M.S. Student Topic: Inferring Document Credibility Java Performance Task Force Developer – GUI Query Interfaces

10/24/2002R. Scott Cost - CADIP, UMBC32 Highlights: A(kshay). Java M.S. Student Topic: Information Scent for Web Search Recently completed an internship at PARC Heading C2 task force on Java performance Developer - C2 Web Crawler agent

10/24/2002R. Scott Cost - CADIP, UMBC33 Highlights: S. Bhatkar M.S. Student Topic: Query Expansion/Enhancement Java Performance Task Force

10/24/2002R. Scott Cost - CADIP, UMBC34 Highlights: D. Harum M.S. Student Topic: Java Real Time Perfomance Monitoring (applied to WONDIR) Integrated monitoring code into SIRE file system, evaluated caching strategies.

10/24/2002R. Scott Cost - CADIP, UMBC35 Highlights: S. Rajavarum M.S. Student Topic: Protocols for Interaction in a Multi-Agent System Java Performance Task Force Newest member of the C2 team

10/24/2002R. Scott Cost - CADIP, UMBC36 Highlights: M. Siegel M.S. Student Employed by the Sponsor Worked on C2/Telltale integration Developer – Distributed file system layer

10/24/2002R. Scott Cost - CADIP, UMBC37 Highlights: T. Laufert M.S. Student Employed by the Sponsor Developer - Document flow visualization tools for C2

10/24/2002R. Scott Cost - CADIP, UMBC38 Highlights: D. Ogle Undergraduate Student Resident Telltale Engineer Integrated Telltale into the C2 system. Also provides Telltale support for ID group.

10/24/2002R. Scott Cost - CADIP, UMBC39 Spotlight: Data Fusion Results fusion is an essential component in the success of a distributed IR system. It is especially difficult when information sources in the system vary widely in content and form.

10/24/2002R. Scott Cost - CADIP, UMBC40 Spotlight: Document Summarization Successful collection selection and comparison depends on accurate metadata. Document summarization may lead us to the construction of more compact and richer metadata collection descriptions.

10/24/2002R. Scott Cost - CADIP, UMBC41 Spotlight: Query Caching By caching query results and returning approximate answers, we hope to reduce the overhead of repeatedly processing similar queries in a distributed environment.

10/24/2002R. Scott Cost - CADIP, UMBC42 Open Issues Semantic Web: There is much to be done still in integrating issues of the semantic web into C2. Indexing and enhancement of marked data Use of markup in routing and fusion Presentation of mixed-type results Data streams

10/24/2002R. Scott Cost - CADIP, UMBC43 Summary In the past 2+ years, the C2 project has generated and sustained significant interest and research in both practical and theoretical aspects of Distributed Information Retrieval. By the end of the Fall semester, C2 will have earned 3 Masters degrees, and will have contributed to several others.

10/24/2002R. Scott Cost - CADIP, UMBC44 Bibliography Cost et al., CARROT II: Collaborative Agent-based Routing and Retrieval of Text, Proceedings of the Fall 2001 CADIP Research Symposium. Cost et al., Integrating Distributed Information Sources with CARROT II, Proceedings of the Workshop on Cooperative Information Agents (CIA), Kallurkar, Document Migration in Distributed Information Retrieval, Masters Thesis for UMBC CSEE, In Preparation: Cost et al., ---, Proceedings of the Fall 2002 CADIP Research Symposium. Cost, WONDIR. Harum, ---, Masters Project for UMBC CSEE. Java et al., Integrating Web Sources with Distributed IR. Kallurkar et al., Comparison of Results Fusion Methods. Majithia, Investigation of Caching Mechanisms in Multi-Agent Based Architecture for Distributed Information Retrieval Systems, Masters Thesis for UMBC CSEE.

10/24/2002R. Scott Cost - CADIP, UMBC45 Bibliography… Also of note: T. Oates, V. Bhat, V. Shanbhag, Using Latent Semantic Analysis to Find Different Names for the Same Entity in Free Text, Proceedings of WIDM, CIKM ’02. U. Shah, Information Retrieval on the Semantic Web, Masters Thesis, UMBC CSEE, Spring U. Shah, T. Finin, A. Joshi, R. S. Cost, J. Mayfield, Information Retrieval on the Semantic Web, Proceedings CIKM ’02.