Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Advertisements

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 NDLTD Union Catalog Panel Session 1C, Auditorium Introduction and Opening Statement ETD 2011: 14 th Int. Symp. on ETDs Cape Town, South Africa Edward.
Lawrence Webley, Hussein Suleman, Tatenda Chipeperekwa University of Cape Town Department of Computer.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
A Quality Focused Crawler for Health Information Tim Tang.
ETD’s at the University of Saskatchewan or… David Fox & Darryl Friesen University of Saskatchewan October 4, 2003.
Merging Taxonomies. Assertion Creation and maintenance of large ontologies will require the capability to merge taxonomies This problem is similar to.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
R utgers C ommunity R epository RU CORE Fedora Repository Object Datastreams.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
Evolution of NBII Search-Based Technologies Oct 24, 2002 Donna Roy USGS Center for Biological Informatics.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Overview of Search Engines
Digital Library Architecture and Technology
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
How to participate in the Union Catalogue Project Hussein Suleman Sivulile – Open Access South Africa Advanced Information Management.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Web Categorization Crawler Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Design & Architecture Dec
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Collaborative Research: Curriculum Development for Digital Library Education Presentation in May 1,2006
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CITIDEL: Computing & Information Technology Interactive Digital Educational Library Web Page: Contacts: Future.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
Seungwon Yang, Haeyong Chung, Chris North, and Edward A. Fox Virginia Tech, Blacksburg, VA USA 1ETD 2010, June 16-18, Austin, TX.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
1 NDLTD Welcome and Introduction ETD 2011: 14 th Int. Symp. on ETDs Cape Town, South Africa Edward A. Fox Executive Director, NDLTD,
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Endeca: a faceted search solution for the library catalog Kristin Antelman & Emily Lynema UNC University Library Advisory Council June 15, 2006.
The SEE-GRID initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no SE4SEE A Grid-Enabled Search.
1 ETD Veterans Plenary Panel ETD 2010: 13 th Int. Symp. on ETDs Austin, TX June 18, 2010 Edward A. Fox Executive Director, NDLTD,
IUScholarWorks Technical Overview Randall Floyd Digital Library Program Programmer/Database Administrator.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
ETD-db: Workflow, the Short Story Edward A. Fox and Gail McMillan Virginia Tech Newcomers’ ETD 2009 University.
VIRGINIA TECH BLACKSBURG CS 4624 MUSTAFA ALY & GASPER GULOTTA CLIENT: MOHAMED MAGDY IDEAL Pages.
Design a full-text search engine for a website based on Lucene
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Video Message: Welcome ETD 2015: 18 th Int’l Symposium on ETDs New Delhi, India Edward A. Fox Executive Director, Chairman of the Board NDLTD,
Introduction to Concept Maps Edward A. Fox and Rao Shen CS5604 Fall 2002 “Information Storage & Retrieval” Dept. of Computer Science Virginia Tech, Blacksburg,
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
1 ETDs for Life Panel ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
ESG-CET Meeting, Boulder, CO, April 2008 Gateway Implementation 4/30/2008.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Development of the West Virginia University Electronic Theses & Dissertations System Presented By Haritha Garapati at ETD the 7 th International.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
1 Improving the ETD Landscape ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Information Organization: Overview
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Federated & Meta Search
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
ETDs for Life Panel ETD 2014: 17th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
CS & CS Capstone Project & Software Development Project
Information Retrieval and Web Design
Information Organization: Overview
Presentation transcript:

Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA ETD 2009 – June 11, 2009

Outline Introduction Introduction Goals Goals Approach Approach Results Results Future Work Future Work

Introduction – great source Electronic submission of dissertations is increasingly preferred. Electronic submission of dissertations is increasingly preferred. ETDs are a great information source. ETDs are a great information source. Substantial amount of research on a topic Substantial amount of research on a topic Thorough literature review Thorough literature review Pointers to other resources (Reference section) Pointers to other resources (Reference section)

Introduction- under-utilized Yet ETDs are under-utilized. Yet ETDs are under-utilized. Research papers, books etc. are still major (and in some cases the only) sources of information for most people. Research papers, books etc. are still major (and in some cases the only) sources of information for most people. Most people (except grad students trained in this) don’t even think about reading a dissertation! Most people (except grad students trained in this) don’t even think about reading a dissertation! Possible causes Possible causes Access to ETDs not streamlined. Access to ETDs not streamlined. Users don’t know where to look for ETDs. Users don’t know where to look for ETDs. ETDs of interest could be buried in search engine results. ETDs of interest could be buried in search engine results. Some universities do not allow outside access to their ETD collection. Some universities do not allow outside access to their ETD collection.

Introduction - needs Efforts have been made to make ETDs more accessible. Efforts have been made to make ETDs more accessible. NDLTD, VTLS, Scirus, etc. provide means of access to ETDs from different universities. NDLTD, VTLS, Scirus, etc. provide means of access to ETDs from different universities. Not very feature rich and convenient: Not very feature rich and convenient: Users search for ETDs based on keywords. Users search for ETDs based on keywords. Don’t know what lies underneath (no idea about the size, topical coverage, etc. of ETD collections) Don’t know what lies underneath (no idea about the size, topical coverage, etc. of ETD collections) Not very amenable to browsing (users have to sift through search results) Not very amenable to browsing (users have to sift through search results)

Goals Provide a portal to ETD collections of more different universities Provide a portal to ETD collections of more different universities Provide value added services Provide value added services Categorize by topic Categorize by topic Support searching and browsing the collection using various criteria (by topic, keywords, date, author, etc.) Support searching and browsing the collection using various criteria (by topic, keywords, date, author, etc.)

Goals - priorities Set up infrastructure for crawling ETDs of various universities Set up infrastructure for crawling ETDs of various universities Come up with techniques for categorizing them into topical areas Come up with techniques for categorizing them into topical areas Set up a user-friendly search and browse interface Set up a user-friendly search and browse interface

Approach Crawl ETDs from various universities Crawl ETDs from various universities Develop a taxonomy Develop a taxonomy Categorize ETDs into topics in the taxonomy tree Categorize ETDs into topics in the taxonomy tree Index the ETDs Index the ETDs Develop a search and browse interface Develop a search and browse interface

Approach - crawling NDLTD’s Union Catalog - as starting point NDLTD’s Union Catalog - as starting point Dublin Core metadata gathered Dublin Core metadata gathered URLs used to crawl ETDs and other data from the respective universities’ websites URLs used to crawl ETDs and other data from the respective universities’ websites Custom crawlers written Custom crawlers written Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.) Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.) All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database. All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database.

Approach - taxonomy Need medium generality and specificity, as opposed to those from Proquest, DMOZ, or Wikipedia Need medium generality and specificity, as opposed to those from Proquest, DMOZ, or Wikipedia For example, DMOZ has more than 500,000 nodes ! For example, DMOZ has more than 500,000 nodes ! Solution? Solution? Prune the DMOZ category tree, and then enhance it using Proquest categorization system. Prune the DMOZ category tree, and then enhance it using Proquest categorization system.

TOP BusinessArtsComputersHealthScienceSociety Category Tree (only top 2 levels shown) Approach – taxonomy levels Level 1 Level 2

Approach - categorize Supervised classification approach used Supervised classification approach used Training set built by using topic labels as query to Google Training set built by using topic labels as query to Google 50 webpages retrieved and used for training Naïve Baye’s classifier for each node (to distinguish between its children) 50 webpages retrieved and used for training Naïve Baye’s classifier for each node (to distinguish between its children) ETD metadata used for categorization ETD metadata used for categorization Level-wise categorization Level-wise categorization

Category Tree Document Sets GoogleNaïve Bayes Classifiers Training Sets Web Interface ETD Collection Categorized ETDs Category label for each node used as query Top 50 webpages (for each node in the tree) Cleanup (stemming, stopword removal, etc.) Level-wise categorization ETD metadata used for categorization Browsing Training ETDs categorized into a node of the category tree (after classification) Approach (contd.) Algorithm Pipeline

Results Crawled metadata for all the ETDs from the NDLTD Union Catalog Crawled metadata for all the ETDs from the NDLTD Union Catalog ~800,000 ETDs in Union Catalog ~800,000 ETDs in Union Catalog 15 Dublin Core fields extracted and stored 15 Dublin Core fields extracted and stored Crawled ~200,000 dissertations from the respective universities (where permissible) and indexing is in progress Crawled ~200,000 dissertations from the respective universities (where permissible) and indexing is in progress Technology used: Lucene search engine Technology used: Lucene search engine More dissertations being crawled More dissertations being crawled

Results (contd.) Enhanced taxonomy developed Enhanced taxonomy developed Some subtrees are shown in the following few slides. Some subtrees are shown in the following few slides. The taxonomy currently is 4 levels deep and has ~200 nodes. The taxonomy currently is 4 levels deep and has ~200 nodes. It is being enhanced to be 5-6 levels deep. It is being enhanced to be 5-6 levels deep.

Results (contd.) Arts SpeechLiteratureArt History Classical Studies Visual Arts Performing Arts …… Enhanced taxonomy (some nodes from the “Arts” subtree shown) Enhanced taxonomy (some nodes from the “Arts” subtree shown) Level 2 Level 3

Results (contd.) Enhanced taxonomy (some nodes from the “Business” subtree shown) Enhanced taxonomy (some nodes from the “Business” subtree shown) Business E-CommerceAccounting Human Resources InvestingBankingManagement …… Level 2 Level 3

Results (contd.) Categorized >74K ETDs from 8 universities Categorized >74K ETDs from 8 universities MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&M MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&M Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Society) Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Society) Categorization into lower levels of category tree (levels 3 and 4, that is) is in progress Categorization into lower levels of category tree (levels 3 and 4, that is) is in progress

Results (contd.) Name of the University Total No. of ETDs Category ArtsBusinessComputersHealthScienceSociety MIT Virginia Tech Ohiolink Rice NCSU Texas A&M CalTech Georgia Tech TOTAL

Results (contd.) Algorithm is time efficient. Algorithm is time efficient. Training the classifier is done offline. Training the classifier is done offline. Classification is fast. Classification is fast. Classifying this collection of ~74,000 ETDs took <30 mins. Classifying this collection of ~74,000 ETDs took <30 mins. Hopefully classifiers developed can be applied to other data and in other systems. Hopefully classifiers developed can be applied to other data and in other systems.

Future Work Increase coverage Increase coverage Crawl more ETDs Crawl more ETDs Collaborate with universities and consortia to gain access to ETD collections Collaborate with universities and consortia to gain access to ETD collections Better categorization approaches Better categorization approaches Leverage query expansion techniques to build training set Leverage query expansion techniques to build training set Web interface to facilitate browsing and search Web interface to facilitate browsing and search User studies to measure the efficacy of the system User studies to measure the efficacy of the system

Questions ? Demo info available at