1 Improving the ETD Landscape ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD, Virginia Tech, Blacksburg, VA USA
Outline Acknowledgments Why, what, who, how Improving, quality Related technical contributions DLs and DL curriculum
Acknowledgments Family, mentors, teachers, students Dissertations: Sung He Park, Venkat Srinivasan, Seungwon Yang NSF: IIS , , All those working with ETDs NDLTD, including its Members, Board, Committees, and Working Groups
Why, What, Who? Why? – enhance graduate education – expand global research collaboration What? – help students communicate more effectively – get ETDs for all TDs: next goal 5 million – help make ETDs open, accessible, preserved Who? – levels: students, faculty, staff, (grad) administrators – professions: CS, IT, LIS, librarians, archivists
How? Authoring systems, tools, methods Data and auxiliary information management aids Metadata creation software and techniques Submission, approval, refinement workflows Local access and information management Sharing, disseminating, discovering – OAI, data providers, harvesting – Regional/national, global institutions Services: access, preservation, adding value Add back files
Improving – 1 of 2 Context: Quality frameworks, references on quality Guidelines and documentation for all of this Works – XML + PDF + raw/original representations – Multimedia, software, simulations, websites, dynamic content Data, auxiliary information, references/bibliographies – Reproducibility Metadata – Completeness: subject classification, faculty by role – Authority info
Improving – 2 of 2 Local services – Training, assistance – IR, archives, archival consortia Global services – Browse, faceted search, full-text search – Recommend, CLIR, CBIR, summaries, topics – Linked data, hyperlinks, citation linking – Alerts, notifications, RSS feeds, filtering
Borgman et al Information Life Cycle (adapted) Authoring Modifying Classifying Tagging Recommending Indexing Storing Retrieving Distributing Networking Retention / Mining Filtering Using Downloading Citing Discovering
Quality and the Information Life Cycle
Quality Dimensions
11 Digital Library Service Taxonomy
Improve related movements Make related efforts work for graduate researchers, ETDs, and university ETD activities: Open access, institutional repositories Sharing references and citations: Zotero, … Sharing data, datasets, workflows; reproducible science: reproducibleresearch.net, … Building author profiles: ORCID, ISNI, … Digital libraries and DL education (DL2014)
Related technical contributions Broadly: new/better systems, user/usage studies, added services, improved practices Automatically assign topics or categories to ETDs or to portions (e.g., chapters) to aid browsing and (faceted) searching Build a union reference collection: by aiding authors (e.g., Hiberlink) and/or by automatic ETD text mining Enhanced information retrieval: cross language IR, content based IR (image/video/music) …
Topic determination Given a document, extract or generate generalized description of its topics Statistical approaches, e.g., LDA Knowledge based approaches, e.g., Xpantrac – Take a webpage or document – Use portions of it to build queries to a knowledge source (Web, Wikipedia, and ETD collection) – Combine, analyze, and summarize the results – Seungwon Yang, "Automatic Identification of Topic Tags from Texts Based on Expansion-Extraction Approach", Jan. 2014, Ph.D. dissertation,
ETD Classification: Venkat Srinivasan Enhance metadata by adding subject categories Hierarchical classification of ETDs (and chapters thereof) using Library of Congress categories Training data – OCLC’s WorldCat: records from 1M books have good labels but little metadata; labels on ETDs not usable – Results coming from queries each designed to describe a category – Need to balance negative and positive examples throughout the LoC taxonomy
Category Tree Document Sets GoogleNaïve Bayes Classifiers Training Sets Web Interface ETD Collection Categorized ETDs Category label for each node used as query Top 50 webpages (for each node in the tree) Cleanup (stemming, stopword removal, etc.) Level-wise categorization ETD metadata used for categorization Browsing Training ETDs categorized into a node of the category tree (after classification) ETD Classification: Algorithm Pipeline
Reference Extraction and Databasing 1.How can we implement metadata schema for bibliographic information? 2.What machine learning methods are effective to extract reference sections including footnotes and chapter references? Sung Hee Park, "Discipline-Independent Text Information Extraction from Heterogeneous Styled References Using Knowledge from the Web", June 2013, VT CS Ph.D. dissertation
Dataflow of Reference Section Extraction Pdf2 txt ETD in PDF Feature Extraction Reference Section Extraction Learning Training data Tagged data Feature Extraction
ETD References: System Architecture ETD Repository Users Web App (e.g., ETD-db) Metadata with References Searching, Browsing, Manipulating Extracting Reference Sections Union ETD References ?
Discovery, Search Engines, Info. Retrieval (to be extended for images, etc.) Documents Search Ranking Q D Query Results Best matches (Q with D) selected Quality of many systems is low, with recall and precision at only around.5, as opposed to 1 at 1.
Search Module Detail (features can be about text, images, …) Query Q Document D1 Feature vector Q Similarity Function Feature vectors D1 Feature vectors D1 S = Sim(Q,D1) In CBIR (Content Based Image Retrieval), search is based on visual content of images – Color – Shape – Texture …
22 DL Definitions: Informal 5S DLs are complex systems that help satisfy info needs of users (societies) provide info services (scenarios) organize info in usable ways (structures) present info in usable ways (spaces) communicate info with users (streams) Use this as: checklist, design guidelines, basis for formal description, specification for software implementation; e.g., Spaces help re GIS, VR
Digital Library Books Edward A. Fox and Jonathan P. Leidig, eds. Digital Library Applications: CBIR, Education, Social Networks, eScience/Simulation, and GIS. Morgan & Claypool Publishers, 2014, 175 p., Edward A. Fox and Ricardo da Silva Torres, eds. Digital Library Technologies: Complex Objects, Annotation, Ontologies, Classification, Extraction, and Security. Morgan & Claypool, 2014, 205 p., Rao Shen, Marcos Andre Goncalves, and Edward A. Fox. Key Issues Regarding Digital Libraries: Evaluation and Integration. Morgan & Claypool, 2013, 110 p., Edward A. Fox, Marcos Andre Goncalves, and Rao Shen. Theoretical Foundations for Digital Libraries: The 5S (Societies, Scenarios, Spaces, Structures, Streams) Approach. Morgan & Claypool, 2012, 180 p., supplementary website
DL Curriculum Project NSF awards to VT and UNC-CH: CS and LIS Project server: Wikiversity: ital_Libraries Table 1: Core DL Curriculum Table 2: Information Retrieval Packages Table 3: LucidWorks Big Data Software Table 4: Multimedia Software 24
DL Curriculum Module Template 1. Module name 2. Scope 3. Learning objectives 4. 5S characteristics of the module (streams, structures, spaces, scenarios, society) 5. Level of effort required (in-class and out-of-class time required for students) 6. Relationships with other modules (flow between modules) 7. Prerequisite knowledge/skills required (what the students need to know prior to beginning the module; completion optional; complete only if prerequisite knowledge/skills are not included in other modules) 8. Introductory remedial instruction (the body of knowledge to be taught for the prerequisite knowledge/skills required; completion optional) 9. Body of knowledge (theory + practice; an outline that could be used as the basis for class lectures) 10. Resources (required readings for students; additional suggested readings for instructor and students) 11. Exercises / Learning activities 12. Evaluation of learning objective achievement (graded exercises or assignments) 13. Glossary 14. Additional useful links 15. Contributors (authors of module, reviewers of module) 25
DL Curriculum Framework 26
DL Curriculum Modules - examples Module 1-b: History of digital libraries and library automation Module 2-c: File Formats, Transformation, and Migration Module 3-b: Digitization Module 4-b: Metadata Module 5-a: Architecture overviews … 27
Summary Scene
Conclusion: Improving together Who will help? What can we do? What knowledge and education is needed? What connections, integrations, collaborations can help with ETDs? Please comment and share! – Ed Fox