San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center.

Slides:



Advertisements
Similar presentations
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Information Retrieval in Practice
Search Engines and Information Retrieval
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
ÆKOS: A new paradigm for discovery and access to complex ecological data David Turner, Paul Chinnick, Andrew Graham, Matt Schneider, Craig Walker Logos.
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
What’s The Difference??  Subject Directory  Search Engine  Deep Web Search.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
Universität Stuttgart Universitätsbibliothek Information Retrieval on the Grid? Results and suggestions from Project GRACE Werner Stephan Stuttgart University.
Search Engines and Information Retrieval Chapter 1.
Robert Roggenbuck Universität Osnabrück/ IWI Osnabrück Wolfram Sperber Konrad-Zuse- Zentrum für Informationstec hnik Berlin (ZIB) Osnabrück,
Pascal Visualization Challenge Blaž Fortuna, IJS Marko Grobelnik, IJS Steve Gunn, US.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Appraisal and Data Mining of Large Size Complex Documents Rob Kooper, William McFadden and Peter Bajcsy National Center for Supercomputing Applications.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
The Internet 8th Edition Tutorial 4 Searching the Web.
Chapter 6: Information Retrieval and Web Search
The Digital Library for Earth System Science: Contributing resources and collections Meeting with GLOBE 5/29/03 Holly Devaul.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Management of Distributed Data Reagan W. Moore.
National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
 hd.jpg hd.jpg Information Retrieval and Interaction.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
The Digital Library for Earth System Science: Contributing resources and collections GCCS Internship Orientation Holly Devaul 19 June 2003.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Search Engines By: Faruq Hasan.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Search Overview Search Features: WSS and Office Search Architecture Content Sources and.
1 Understanding Cataloging with DLESE Metadata Karon Kelly Katy Ginger Holly Devaul
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Information Retrieval LECTURE 1 : Introduction.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
Web 2.0: Making the Web Work for You, Illustrated Unit A: Research 2.0.
Toward Semantic Search: RDFa based facet browser Jin Guang Zheng Tetherless World Constellation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Chapter 8: Web Analytics, Web Mining, and Social Analytics
Data Grids, Digital Libraries and Persistent Archives: An Integrated Approach to Publishing, Sharing and Archiving Data. Written By: R. Moore, A. Rajasekar,
Information Retrieval in Practice
Information Organization: Overview
User Characterization in Search Personalization
Technical Issues in Sustainability
Information Organization: Overview
Information Retrieval and Web Design
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

San Diego Supercomputer Center Analyzing the NSDL Collection Peter Shin, Charles Cowart Tony Fountain, Reagan Moore San Diego Supercomputer Center

Introduction Education Impact and Evaluation Standing Committee (EIESC) Goal: Characterize the contents of the NSDL collection by topic, audience and type Motivation: Enable search of the NSDL collection by subject, audience, and type Inform future collection activities

San Diego Supercomputer Center Research Focus Intelligent Information Retrieval (IR) for the NSDL community by providing efficient discovery and access to relevant materials Support queries on audience, topic, and type

San Diego Supercomputer Center Information Retrieval Techniques Keyword Searches Find a document that contains the list of words in the search. Example: digestive system -> “Digestive System Web Resources for Students” Relevance-based search, probabilistic approach Instead of the actual keywords, find documents that have words with similar meaning as in the query Example: break down food -> NCC Food Group’s Food Stuff; break down food body -> “Squirrel Tales” Text Categorization Given the contents of a document, assign topic labels. Hybrid or combination approaches Mix two or more of the above Example: First, keyword search, then text categorization

San Diego Supercomputer Center Challenges Not enough of metadata e.g audience type (intended grade level) Need metadata standards no structure in the metadata Example: Odd/Even Number – subject: number senses, No mathematics, algebra, or number theory No concept map or ontology to capture complex topic relationships Example: Relationship between algebra and calculus Need to make annotation easy and accurate Assist or automate labeling the documents with standard annotation Possible errors in the existing hand-labeled documents Example: mismatch between the contents of the front pages and their metadata Computationally intensive Over 20,000 HTML documents with over 1,300,000 unique terms

San Diego Supercomputer Center Suggestions Involve the community to define methods, sample queries and evaluation criteria, generate metadata, perform comparative studies Create a training/testing data set that is annotated and checked for correctness which can be used by all researchers Provide a forum for sharing methods and results Build an evaluation testbed – collect data, algorithms, tools, results, plus hardware and software, provide online web portal interface to the resources

San Diego Supercomputer Center Status of the NSDL testbed at SDSC Monthly web crawl of the NSDL sites Persistent archive of the harvested materials Processing pipeline for various IR techniques Software Resources: Storage Resource Broker (SRB) NSDL Archive Service (Web Crawling) Various processing pipeline scripts that can run in parallel SVMLight by Thorsten Joachim Latent Semantic Indexing from Telecordia Latent Dirichlet Allocation by David Blei from UC Berkeley Cheshire – online catalog and full text retrieval system (from UC Berkeley and University of Liverpool) Hardware Resources: IBM Datastar – supercomputer with 10.4 teraflops of computing power TeraGrid – collection of supercomputers with high throughput communication

San Diego Supercomputer Center Summary Metadata evaluation is important and challenging. Information retrieval techniques are promising. NSDL community involvement is necessary to define evaluation methods. Collaborative testbed would facilitate analysis. An initial testbed is under development at SDSC.

San Diego Supercomputer Center Latent Semantic Indexing (LSI) Assumption: If documents have many words in common, the documents are closely related. Application: Search Engine Archivist’s Assistance Automated Writing Assessment Information Filtering Drawbacks: Not scalable No incremental update

San Diego Supercomputer Center Clustering before LSI Idea: Instead of searching in the whole space, search within the concept space. Task: Define the levels of granularity in the document space. Cluster the documents according to the concept space Apply LSI within a cluster.

San Diego Supercomputer Center Process HTML Documents Strip Formatting Pick out content words using “stop lists” Stemmi ng List of Words Discard words that appear too frequently or too sparsely Term Weightin g Each document in the Term Document Matrix is a “vector” Build concept clusters Apply LSI within A cluster

San Diego Supercomputer Center Levels of Granularity Collecti on Documen t Section Subsecti on

San Diego Supercomputer Center Building a Concept Space Collection Documen t Section Subsectio n Co- Adjacent Granules Finer Granularit y

San Diego Supercomputer Center Hypotheses Definition: Significant terms: Defined by the frequency of words in a granule Hypotheses: As the granularity becomes finer, the number of significant terms in a granule goes down. Within one granule, the overlapping significant terms around the specific space decrease as it moves further from it. Appropriate level of granularity for a knowledge is when the number of significant terms is the maximum and the number of overlapping significant terms around the space is minimum.

San Diego Supercomputer Center Sample Data Web Crawl On each document, web crawl written by Charles Cowart gathers 20 levels deep. Size: 200 GB and 1.7 million files