When the subjects of metadata embrace the statistical learning

Slides:



Advertisements
Similar presentations
FROM RLIN TO OCLC CONNEXION DIFFERENT WORKFLOWS AND DIFFERENT PRACTICE Teresa Mei East Asian Catalog Librarian Cornell University Library.
Advertisements

Catherine Worrall Slide Library Co-ordinator, University College Falmouth.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Using Reference Sources Fleet RISD. Why Use Reference Sources? Reference Sources provide an overview of a subject at the beginning of the research.
Subject Analysis: An Introduction Based on BASIC SUBJECT CATALOGING USING LCSH edited by Lori Robare.
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
C HAPTER 5 Writing the Research Paper. C OMING U P WITH A T OPIC What are you interested in? Do you have a unique perspective on something? What would.
Union Catalog and Knowledge Engineering for TELDAP Keh-Jiann Chen Principal Investigator Core Platforms for Digital Contents Project, TELDAP Research Fellow.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Predicting Text Quality for Scientific Articles AAAI/SIGART-11 Doctoral Consortium Annie Louis : Louis A. and Nenkova A Automatically.
Using Metadata in CONTENTdm Diana Brooking and Allen Maberry Metadata Implementation Group, Univ. of Washington Crossing Organizational Boundaries Oct.
ITCS 6010 Natural Language Understanding. Natural Language Processing What is it? Studies the problems inherent in the processing and manipulation of.
Library Workshop Searching Social Sciences Citation Index.
Semantic Video Classification Based on Subtitles and Domain Terminologies Polyxeni Katsiouli, Vassileios Tsetsos, Stathes Hadjiefthymiades P ervasive C.
EFL Teacher Education Programmes in Latvia. The Rights Right to Work as a Teacher (Education Law)  Person who has pedagogical education, or who is acquiring.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
PAIRS Forming a ranked list using mined, pairwise comparisons Reed A. Coke, David C. Anastasiu, Byron J. Gao.
English Language Arts Program Update Cambrian School District.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
Updated :02 Hong Kong University of Science & Technology Library XML Name Access Control Repository at the Hong Kong University of Science.
Kathleen Padova INFO 861 January 20, Emerged in different disciplines, academically Continued to develop in different disciplines in practice Information.
Challenges of Discovery Tools Challenges of Discovery Tools Shelly Shen-Aridor Younes & Soraya Nazarian Library Haifa university, Israel Session
Which of the two appears simple to you? 1 2.
Producción de Sistemas de Información Agosto-Diciembre 2007 Sesión # 8.
NCSU Libraries Kristin Antelman NCSU Libraries June 24, 2006.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
1 Relevance Ranking in the Scholarly Domain Dr. Tamar Sadeh LIBER Conference Tartu, Estonia, June 2012 Dr. Tamar Sadeh LIBER Conference Tartu, Estonia,
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Combining Unsupervised Feature Selection.
ALA Annual Meeting Claire Cocco Global Product Manager CONTENTdm Users Group June 30th, 2008.
Image Discovery & Access ACRL Image Resources Interest Group ALA Annual, Saturday, June 26, 2010 Nicole Finzer, Visual Resources Librarian, Digital Collections,
Carnegie High School Fall 2015 Frederic Murray Assistant Professor MLIS, University of British Columbia BA, Political Science, University of Iowa Instructional.
Some basic concepts Week 1 Lecture notes INF 384C: Organizing Information Spring 2016 Karen Wickett UT School of Information.
© 2004 Reviews.com™ 1 Reviews: A Front End to Literature Bruce Antelman
Information organization Week 2 Lecture notes INF 380E: Perspectives on Information Spring 2015 Karen Wickett UT School of Information.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Subject Analysis: An Introduction
Queensland University of Technology
Remodeling the Getty Provenance Index as Linked Open Data
Olwyn Alexander & Sue Argent
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
CLIR Chinese Cataloging Project: Status Report
From the old to the new… Towards better resource discoverability
Metadata Standards - Types
2017 ALA Midwinter Metadata Interest Group Meeting
Professional development training on cataloging at the University Wisconsin-Madison Memorial Library, USA 14th October -24th October, 2016 Aigerim Shurshenova.
The General Education Core in CLAS
Authorities in Alma and F3
Reading Notes Wang Ning Lab of Database and Information Systems
Natural Language Processing (NLP)
Critically Reviewing the Literature
User Interface HEP Summit, DESY, May 2008
Advanced English 6 September 27-28
Generating Natural Answers by Incorporating Copying and Retrieving Mechanisms in Sequence-to-Sequence Learning Shizhu He, Cao liu, Kang Liu and Jun Zhao.
When the subjects of metadata embraces the statistical learning
Automated MS Word and PowerPoint Translator
A Level English Language
IL Step 3: Using Bibliographic Databases
Introduction of KNS55 Platform
Accessing and searching for journals and wider material
Natural Language Processing (NLP)
Unsupervised Machine Learning: Clustering Assignment
Speech Enhancement Based on Nonparametric Factor Analysis
Natural Language Processing (NLP)
Using FAST (Faceted Application of Subject Headings) in CONTENTdm
Presentation transcript:

When the subjects of metadata embrace the statistical learning Anlin Yang, East Asian Cataloging Librarian University of Iowa Libraries

INVESTIGATION IMPLEMENTATION The growing number of Chinese materials The new change of subjects IMPLEMENTATION Universally: important, but share some data. After, my challenges and choices on 2 aspects Current statistical learning methods/frameworks The assumptions of statistical learning application on subjects of metadata

The Number of Chinese Volumes in U.S. University Libraries BACKGROUND The Number of Chinese Volumes in U.S. University Libraries Public University Libraries (25 Institutions) Source: CEAL Statistics Data

Private University Libraries (16 Institutions) BACKGROUND The Number of Chinese Volumes in U.S. University Libraries Private University Libraries (16 Institutions) Source: CEAL Statistics Data

The Satisfaction to Search Non-Roman Scripts BACKGROUND The Satisfaction to Search Non-Roman Scripts Satisfaction with using controlled English subjects to find Non-Roman scripts Source: El-Sherbini, M., & Chen, S. (2011). An assessment of the need to provide non-Roman subject access to the library online catalog. Cataloging & Classification Quarterly, 49(6), 457-483. http://dx.doi.org/10.1080/01639374.2011.603108

1898 1996 1998 1940 1997 2016 Increasing update frequency BACKGROUND Some Controlled Vocabularies and Thesaurus Release Timeline 1898 1996 1998 LCSH ERIC Thesaurus Getty Vocabularies The Art & Architecture Thesaurus (AAT) The Getty Thesaurus of Geographic Names (TGN) The Cultural Objects Name Authority (CONA) The Union List of Artist Names (ULAN) For Education 1940 1997 2016 MeSH Transportation Research Thesaurus COAR Vocabularies For Medicine Resource type to identify the genre of a research resource: October 2016 Access mode to declare the degree of 'openness 'of a resource (draft): May 2017 Based on NCHRP  20-32(2) Increasing update frequency More detailed on subject classification

What challenges we meet on the subjects of metadata? The continuous growing number of Chinese materials How fast we could manage those metadata? The difficulties of language barriers for searching Chinese resources How easy we could swing between different languages? The rapid changes on academic studies One hand: RDA, the other hand: original How possible we could learn brand new academic knowledge continually?

Core Ideas For Us Segmentation Words / phrases discovery Build lexicon One hand: RDA, the other hand: original Build lexicon

Matching with a prior dictionary METHODOLOGY Segmentation Word Dictionary Model (WDM) Machine grabbing Matching with a prior dictionary A ranking list of words A prior dictionary sentences Words Machine reading Source: Olivier, D. C. (1968). Stochastic grammars and language acquisition mechanisms: a thesis. Harvard University.

Words / Phrases Discovery METHODOLOGY Words / Phrases Discovery The ambiguity of Chinese words segmentation “土地公有政策”(Policy of Public Ownership of Land) Correct: 土地 公有 政策 (land, public ownership, policy) The possibility of words segmentation: 土地公 有 政策 (the earth god, has, policy) Source: Chang, J. S., & Su, K. Y. (1997). An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 2, August 1997, 2(2), 97-148.

Segmentation Top-down: Expectation-Maximization (EM) algorithm METHODOLOGY Segmentation Top-down: Expectation-Maximization (EM) algorithm Parameters Rethink Source: Do, C. B., & Batzoglou, S. (2008). What is the expectation maximization algorithm?. Nature biotechnology, 26(8), 897. doi:10.1038/nbt1406

Segmentation EM algorithm and its Chinese words catching METHODOLOGY Grab a sentence Assume as many as possible ways to segment Share these segmentations to other sentences Acquire the highest likelihood Source: Ge, X., Pratt, W., & Smyth, P. (1999, August). Discovering Chinese words from unsegmented text. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 271-272). ACM.

Build Lexicon TopWORDS: The unsupervised analysis of Chinese texts METHODOLOGY Build Lexicon TopWORDS: The unsupervised analysis of Chinese texts Source: Deng, K., Bol, P. K., Li, K. J., & Liu, J. S. (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences, 113(22), 6154-6159.  https://doi.org/10.1073/pnas.1516510113.

METHODOLOGY Build Lexicon TopWORDS and its analysis results (search: Deng Lab, Tsinghua) The word frequency of History of the Song Dynasty The top topics from Sina bloggers Source: Deng, K., Bol, P. K., Li, K. J., & Liu, J. S. (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences, 113(22), 6154-6159. https://doi.org/10.1073/pnas.1516510113.

What could we get from statistical learning? A lexicon or ranked list of words To help us extract the subjects of metadata Word frequency To help us catch keywords and discard non-critical information Text probable preferences and topics One hand: RDA, the other hand: original To help us figure out the academic areas

Assumption of Frameworks Integration Workflow One hand: RDA, the other hand: original Cooperation

Word frequency, text preferences & topics ASSUMPTION Integration Statistical learning Library job Word frequency, text preferences & topics Source: Deng, K., Bol, P. K., Li, K. J., & Liu, J. S. (2016). On the unsupervised analysis of domain-specific Chinese texts. Proceedings of the National Academy of Sciences, 113(22), 6154-6159.  https://doi.org/10.1073/pnas.1516510113.

Workflow ASSUMPTION Selection Acquisition run statistical learning machine obtain word frequency, topic preference Cataloging subjects extraction learn by machine Marking / Shelving subjects comparison lexicon update

Statistical Learning Labs ASSUMPTION Cooperation Report Libraries Statistical Learning Labs Update

Thank you! anlin-yang@uiowa.edu