Presentation is loading. Please wait.

Presentation is loading. Please wait.

Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.

Similar presentations


Presentation on theme: "Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA."— Presentation transcript:

1 Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA ETD 2009 – June 11, 2009

2 Outline Introduction Introduction Goals Goals Approach Approach Results Results Future Work Future Work

3 Introduction – great source Electronic submission of dissertations is increasingly preferred. Electronic submission of dissertations is increasingly preferred. ETDs are a great information source. ETDs are a great information source. Substantial amount of research on a topic Substantial amount of research on a topic Thorough literature review Thorough literature review Pointers to other resources (Reference section) Pointers to other resources (Reference section)

4 Introduction- under-utilized Yet ETDs are under-utilized. Yet ETDs are under-utilized. Research papers, books etc. are still major (and in some cases the only) sources of information for most people. Research papers, books etc. are still major (and in some cases the only) sources of information for most people. Most people (except grad students trained in this) don’t even think about reading a dissertation! Most people (except grad students trained in this) don’t even think about reading a dissertation! Possible causes Possible causes Access to ETDs not streamlined. Access to ETDs not streamlined. Users don’t know where to look for ETDs. Users don’t know where to look for ETDs. ETDs of interest could be buried in search engine results. ETDs of interest could be buried in search engine results. Some universities do not allow outside access to their ETD collection. Some universities do not allow outside access to their ETD collection.

5 Introduction - needs Efforts have been made to make ETDs more accessible. Efforts have been made to make ETDs more accessible. NDLTD, VTLS, Scirus, etc. provide means of access to ETDs from different universities. NDLTD, VTLS, Scirus, etc. provide means of access to ETDs from different universities. Not very feature rich and convenient: Not very feature rich and convenient: Users search for ETDs based on keywords. Users search for ETDs based on keywords. Don’t know what lies underneath (no idea about the size, topical coverage, etc. of ETD collections) Don’t know what lies underneath (no idea about the size, topical coverage, etc. of ETD collections) Not very amenable to browsing (users have to sift through search results) Not very amenable to browsing (users have to sift through search results)

6 Goals Provide a portal to ETD collections of more different universities Provide a portal to ETD collections of more different universities Provide value added services Provide value added services Categorize by topic Categorize by topic Support searching and browsing the collection using various criteria (by topic, keywords, date, author, etc.) Support searching and browsing the collection using various criteria (by topic, keywords, date, author, etc.)

7 Goals - priorities Set up infrastructure for crawling ETDs of various universities Set up infrastructure for crawling ETDs of various universities Come up with techniques for categorizing them into topical areas Come up with techniques for categorizing them into topical areas Set up a user-friendly search and browse interface Set up a user-friendly search and browse interface

8 Approach Crawl ETDs from various universities Crawl ETDs from various universities Develop a taxonomy Develop a taxonomy Categorize ETDs into topics in the taxonomy tree Categorize ETDs into topics in the taxonomy tree Index the ETDs Index the ETDs Develop a search and browse interface Develop a search and browse interface

9 Approach - crawling NDLTD’s Union Catalog - as starting point NDLTD’s Union Catalog - as starting point Dublin Core metadata gathered Dublin Core metadata gathered URLs used to crawl ETDs and other data from the respective universities’ websites URLs used to crawl ETDs and other data from the respective universities’ websites Custom crawlers written Custom crawlers written Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.) Technologies used: Perl, and other open source Perl libraries (WWW, Mechanize, etc.) All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database. All metadata (Dublin Core metadata from Union Catalog, and the metadata obtained from respective universities) is stored in our MySQL backend database.

10 Approach - taxonomy Need medium generality and specificity, as opposed to those from Proquest, DMOZ, or Wikipedia Need medium generality and specificity, as opposed to those from Proquest, DMOZ, or Wikipedia For example, DMOZ has more than 500,000 nodes ! For example, DMOZ has more than 500,000 nodes ! Solution? Solution? Prune the DMOZ category tree, and then enhance it using Proquest categorization system. Prune the DMOZ category tree, and then enhance it using Proquest categorization system.

11 TOP BusinessArtsComputersHealthScienceSociety Category Tree (only top 2 levels shown) Approach – taxonomy levels Level 1 Level 2

12 Approach - categorize Supervised classification approach used Supervised classification approach used Training set built by using topic labels as query to Google Training set built by using topic labels as query to Google 50 webpages retrieved and used for training Naïve Baye’s classifier for each node (to distinguish between its children) 50 webpages retrieved and used for training Naïve Baye’s classifier for each node (to distinguish between its children) ETD metadata used for categorization ETD metadata used for categorization Level-wise categorization Level-wise categorization

13 Category Tree Document Sets GoogleNaïve Bayes Classifiers Training Sets Web Interface ETD Collection Categorized ETDs Category label for each node used as query Top 50 webpages (for each node in the tree) Cleanup (stemming, stopword removal, etc.) Level-wise categorization ETD metadata used for categorization Browsing Training ETDs categorized into a node of the category tree (after classification) Approach (contd.) Algorithm Pipeline

14 Results Crawled metadata for all the ETDs from the NDLTD Union Catalog Crawled metadata for all the ETDs from the NDLTD Union Catalog ~800,000 ETDs in Union Catalog ~800,000 ETDs in Union Catalog 15 Dublin Core fields extracted and stored 15 Dublin Core fields extracted and stored Crawled ~200,000 dissertations from the respective universities (where permissible) and indexing is in progress Crawled ~200,000 dissertations from the respective universities (where permissible) and indexing is in progress Technology used: Lucene search engine Technology used: Lucene search engine More dissertations being crawled More dissertations being crawled

15 Results (contd.) Enhanced taxonomy developed Enhanced taxonomy developed Some subtrees are shown in the following few slides. Some subtrees are shown in the following few slides. The taxonomy currently is 4 levels deep and has ~200 nodes. The taxonomy currently is 4 levels deep and has ~200 nodes. It is being enhanced to be 5-6 levels deep. It is being enhanced to be 5-6 levels deep.

16 Results (contd.) Arts SpeechLiteratureArt History Classical Studies Visual Arts Performing Arts …… Enhanced taxonomy (some nodes from the “Arts” subtree shown) Enhanced taxonomy (some nodes from the “Arts” subtree shown) Level 2 Level 3

17 Results (contd.) Enhanced taxonomy (some nodes from the “Business” subtree shown) Enhanced taxonomy (some nodes from the “Business” subtree shown) Business E-CommerceAccounting Human Resources InvestingBankingManagement …… Level 2 Level 3

18 Results (contd.) Categorized >74K ETDs from 8 universities Categorized >74K ETDs from 8 universities MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&M MIT, Virginia Tech, Caltech, NCSU, Georgia Tech, Ohiolink, Rice, Texas A&M Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Society) Categorized into 5 topical areas (Arts, Business, Computers, Health, Science, Society) Categorization into lower levels of category tree (levels 3 and 4, that is) is in progress Categorization into lower levels of category tree (levels 3 and 4, that is) is in progress

19 Results (contd.) Name of the University Total No. of ETDs Category ArtsBusinessComputersHealthScienceSociety MIT29804653184765073757141555 Virginia Tech11976742627266512183317340 Ohiolink80201056350126713222887345 Rice66859372351181145241262 NCSU502628324514195122436114 Texas A&M483430236313635662115125 CalTech47745852139229309618 Georgia Tech358232133134885123323 TOTAL7470140633852171424252246371582

20 Results (contd.) Algorithm is time efficient. Algorithm is time efficient. Training the classifier is done offline. Training the classifier is done offline. Classification is fast. Classification is fast. Classifying this collection of ~74,000 ETDs took <30 mins. Classifying this collection of ~74,000 ETDs took <30 mins. Hopefully classifiers developed can be applied to other data and in other systems. Hopefully classifiers developed can be applied to other data and in other systems.

21 Future Work Increase coverage Increase coverage Crawl more ETDs Crawl more ETDs Collaborate with universities and consortia to gain access to ETD collections Collaborate with universities and consortia to gain access to ETD collections Better categorization approaches Better categorization approaches Leverage query expansion techniques to build training set Leverage query expansion techniques to build training set Web interface to facilitate browsing and search Web interface to facilitate browsing and search User studies to measure the efficacy of the system User studies to measure the efficacy of the system

22 Questions ? svenkat@vt.edu fox@vt.edu Demo info available at http://fox.cs.vt.edu/etdbrowse/


Download ppt "Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA."

Similar presentations


Ads by Google