IVia and Data Fountains: Open Source Internet Portal System and Metadata Generation Service for Amplifying the Efforts of Subject Experts Julie Mason,

Slides:

Advertisements

Similar presentations

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.

Advertisements

Chapter 5: Introduction to Information Retrieval

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

Natural Language Processing WEB SEARCH ENGINES August, 2002.

1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.

Advanced Searching Engineering Village.

Engineering Village ™ Basic Searching.

Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.

WWW Challenges : Supporting Users in Search and Navigation Natasa Milic-Frayling Microsoft Research, Cambridge UK SOFSEM 2004 January 28, 2004.

Information & Library Services Australian Education Index, British Education Index and ERIC Sally Giffen August 2006.

NSF – DLF – JISC/UKOLN Digital Library Service Registry Workshop National Science Foundation, Arlington, VA March 2006 The University of Illinois.

Information Retrieval in Practice

December 9, 2002 Cheshire II at INEX -- Ray R. Larson Cheshire II at INEX: Using A Hybrid Logistic Regression and Boolean Model for XML Retrieval Ray R.

Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.

Search Strategy – Scopus Margaret Vugrin, MSLS, AHIP January 1, 2015.

Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.

Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.

Exploring Bridges Between Open and Vetted Information Domains By Julie Mason, Data Fountains Service Manager Johannes Ruscheinski, Lead Programmer Wagner.

Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.

6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.

Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.

Introducing Symposia : “ The digital repository that thinks like a librarian”

Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.

SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.

Information Architecture Donna Maurer Usability Specialist.

Overview of Search Engines

An introduction to databases In this module, you will learn: What exactly a database is How a database differs from an internet search engine How to find.

New Web of Science Rachel Mangan Customer Education

CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.

Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.

CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.

OpenURL Link Resolvers 101

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.

Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.

Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore

Beyond Search Engines: Advanced Web Searching Subject Directories  Librarians’ Index to the Internet  Infomine Finding Databases on a Subject  The Invisible.

Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.

NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.

1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

WISER Social Sciences: Finding Quality Information on the Internet Angela Carritt and Penny Schenk Bodleian Law Library.

Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.

1 A Very Large Digital Library Technology Demonstration William Y. Arms Cornell University.

Search Engines By: Faruq Hasan.

Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.

1 Overview Finding and importing data sets –Searching for data –Importing data_.

Information Retrieval

JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.

Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.

Web 2.0: Making the Web Work for You, Illustrated Unit A: Research 2.0.

Web Search Architecture & The Deep Web

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.

WISER: What’s new in Science SCOPUS, SCIRUS and Google Scholar Kate Williams and Juliet Ralph May 2006.

Chapter 8: Web Analytics, Web Mining, and Social Analytics

General Architecture of Retrieval Systems 1Adrienn Skrop.

SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.

HOW DOES INFOMINE WEBSITE WORKS ENID FRANCO. INFOMINE is a website in which people at university or college levels can use the web as a library resource.

Information Retrieval in Practice

Information Retrieval

Data Mining Chapter 6 Search Engines

Introduction to Information Retrieval

Information Retrieval and Web Design

Presentation transcript:

iVia and Data Fountains: Open Source Internet Portal System and Metadata Generation Service for Amplifying the Efforts of Subject Experts Julie Mason, Data Fountains Service Manager University of California, Riverside LITA 2004 National Forum St. Louis, Missouri

SECTION I: Technologies and Architectures of iVia and Data Fountains SECTION II: Classification SECTION III: Preview of Data Fountains Interface iVia and Data Fountains

SECTION I:Technologies and Architectures of iVia and Data Fountains iVia and Data Fountains

Technologies and Architecture New and Interactive Collection Building Technologies to Amplify Expert Effort / New Uses of Expertise to Refine Collection Building Technologies  Focused crawling  Rich Text Identification and Harvest  Machine-based Classification  Foundation Record  Hybrid Collections Architecture  Usage in Existing, Closed, Relatively Homogeneous Collections  Cooperative Technology  Appropriately Scaled and Modularly Designed iVia and Data Fountains

 The new open source software platform designed to help INFOMINE and other virtual libraries scale well in terms of amplifying expert effort to enable the development of better and more representative collections.  Automated and semi-automated Internet resource identification and collection is made possible through focused crawling software.  Automated and semi-automated indexing or metadata generation is made possible through classifier software.  A hybrid, two-tiered collection is supported. The first tier are our expert created records and the second tier is made up of machine created records.  Brings together some of the best of expert created virtual library approaches with the best of automated approaches to collection building

iVia and Data Fountains Architecture overview of iVia

iVia and Data Fountains Master Database  MySQL-based SQL database  Contains both expert and robot generated records  Contains metadata (URL’s, subjects, keywords, authors, titles, …)  Contains full text in the form of compressed Web page content

iVia and Data Fountains The Adder Interface  Sophisticated Web interface for expert classification of Web pages  Password-protected with varying privilege levels  Allows both adding of new resources and editing of already existing ones  Has automatic resource duplicate checking  Contains an automatic metadata extractor  Configurable via a preferences screen

iVia and Data Fountains Crawlers  Add robot records to the master database  Assign metadata to crawled records  Three different types of crawlers in iVia/DF  Expert-guided crawler with drill-down and drill-out to crawl single sites  VL-crawler to crawl virtual libraries  Nalanda iVia Focused Crawler (NIFC) to crawl Web communities defined around a given topic

iVia and Data Fountains Search Engine  Public search interface (e.g.  Based on inverted index databases built from the contents of the master database on a nightly basis  Supports sophisticated searching through metadata and full-text  Nested boolean queries, truncated searches, word proximity searches, etc  Search results can be displayed in a wide variety of different themes (skins) that allow collaborating institutions to brand their interface

iVia and Data Fountains (under development this year)  A cooperative, cost-recovery based metadata generation service that will be an array of iVias, one for each participating project or subject community, and which will create metadata records for the participants.  A big emphasis, in addition to fully-automated resource discovery and metadata generation, will be on semi-automated approaches that strongly involve and amplify the efforts of collection experts. They, in turn, work to refine and perfect machine approaches and processes.  The metadata created can be bundled in differing “products” according to differing participant needs in terms of amount of metadata needed, type (natural language or controlled terminology), degree of relevance or comprehensiveness desired (highly relevant records or moderately relevant).

iVia and Data Fountains Architecture overview of DF

iVia and Data Fountains Seed Set Generator  Seed sets are sets of URL’s that define a topic of interest  Seed sets can be supplied in various formats by a client (e.g. simple text file with a list of URL’s)  Typically need around 200 highly topic-specific URL’s  Problem: most users would come up with only a few dozen  Solution: scout module uses a search engine such as Google to fatten up the user-provided initial set

iVia and Data Fountains Nalanda iVia Focused Crawler  Primarily developed by Dr. Soumen Chakrabarti (IIT Bombay), a leading crawler researcher  Sophisticated focused crawler using document classification methods and Web graph analysis techniques to stay on topic  Supports user interaction via URL pattern blacklisting etc  Uses an apprentice classifier to prioritize links that should be followed  Returns a list of URL’s likely to be on the initial seed set topic

iVia and Data Fountains

Distiller  Attempts to rank URL’s returned by the NIFC according to their relevance to the client-provided topic  Uses improved Kleinberg-like Web graph analysis to assign hub and authority values to each URL  Returns scores for each provided URL

iVia and Data Fountains Metadata Exporter  Final stage of DF  Provides clients with convenient data formats to incorporate the best on-topic URL’s into their own databases  Allows different amounts/quality of metadata to be exported based on the client’s selected service model  Supports various export types and file formats (simple URL lists, delimiter-separated file formats, XML file formats, MARC records and export via OAI-PMH)

iVia and Data Fountains Modular Architecture that Supports a Federated Array of Subject Specific Focused Crawlers and Classifiers

 INFOMINE is a virtual library containing over 100,000 links (A hybrid collection containing 26,000 librarian created links and 75,000 plus robot/crawler created links).  Founded in January of 1994 it is one of the first Web-based services offered by a library anywhere.  It is a cooperative effort of librarians from UC Riverside, other UCs (including UCLA, UCSC and the UC Shared Cataloging Project), three California State Universities, Wake Forest University and the University of Detroit. Special cooperative efforts are in process with the Library of Congress and NSDL. iVia and Data Fountains

iVia and Data Fountains SECTION II:Classification

iVia and Data Fountains  LCC: Library of Congress Categories  LCSH: Library of Congress Subject Headings  INFOMINE Subject Categories Biological, Agricultural, and Medical Sciences Business and Economics Cultural Diversity Electronic Journals Government Info Maps and Geographical Information Systems Physical Sciences, Engineering, and Mathematics Social Sciences and Humanities Visual and Performing Arts Classification: Example Subject Categories

iVia and Data Fountains Example

iVia and Data Fountains Example: Korea Rice Genome Database  Is it about… –Geography ? –Agriculture ? –Genetics ?  Which INFOMINE category do we put it in ? –Biological, Agricultural, and Medical Sciences  Pretty obvious, right ? –For humans, yes. But how do we automate it ?

iVia and Data Fountains Automating Document Classification We need a way to measure document similarity Each document is basically just a list of words, so we can count how frequently each word appears in it These word frequencies are one of many possible document attributes Document similarity is mathematically defined in terms of document attributes

iVia and Data Fountains Automating Document Classification  The previous slide contains 51 words –document6 –word, of 3 each –we, a, in, is, each 2 each –All other words 1 each  Note that we consider words such as word and words to be the same  We also don’t care about capitalization  In general, we’d also ignore non-descriptive words such as we, a, of, the, and so on

iVia and Data Fountains Automating Document Classification  Not an easy task –The distribution of words shows that the slide in question is not very rich in content The most frequent word (document) is not very descriptive The most descriptive word (classification) does not appear very frequently in the slide –How descriptive and how frequent a word should be depends on the category  The task is easier when: –we have a large number of content-rich documents –categories are characterized by very specific words which don’t appear very frequently in other categories

iVia and Data Fountains Automating Document Classification  Two documents sharing a large number of category-specific words are considered to be very similar to each other  Document similarity can thus be quantified and computed automatically  Documents can then be ranked by their similarity to each other  A large group of documents that are all very similar to each other can then be considered to define the category they belong to (the set of all such groups is called the Training Corpus)  One way to classify a document is then to put it in the same category as that of the training document that it’s most similar to

iVia and Data Fountains Automating Document Classification  The classification method just described is known as the Nearest Neighbor method  There are other methods, which may be more suited for the classification of documents from the Internet –Naïve Bayes –Support Vector Machine (SVM) –Logistic Regression  Infomine uses a flexible approach – supporting all of these methods – in an attempt to produce highly-accurate classifications

iVia and Data Fountains SECTION III:Preview of Data Fountains Interface