L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

HATHI TRUST A Shared Digital Repository Delivering Data For New Generations of Research Strategies and Challenges Jeremy York NISO/BISG Forum ALA 2010.
EPrints Web Configuratio n Management. SQL database Web server Scripts to configure repository activities Configuration files EPrints - the Administrator's.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
JSTOR User Services l February 2009 Using the JSTOR Interface User Services, February 2009.
Information Retrieval in Practice
Search Engines and Information Retrieval
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Introduction to Library Research Gabriela Scherrer Reference Librarian for English Languages and Literatures, University Library of Bern.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
Tuple – InfoVis Publication Browser CS533 Project Presentation by Alex Gukov.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Microsoft ® Official Course Interacting with the Search Service Microsoft SharePoint 2013 SharePoint Practice.
Overview of Search Engines
GL12 Conf. Dec. 6-7, 2010NTL, Prague, Czech Republic Extending the “Facets” concept by applying NLP tools to catalog records of scientific literature *E.
Databases & Data Warehouses Chapter 3 Database Processing.
PubMed/How to Search, Display, Download & (module 4.1)
Using JSTOR November What is JSTOR?JSTOR 2.JSTOR demonstration −Searching JSTOR −Format of the journal content −Using a MyJSTOR account to organize.
SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.
How to Use Google Scholar An Educator’s Guide
Search Engines and Information Retrieval Chapter 1.
Improving the Catalogue Interface using Endeca Tito Sierra NCSU Libraries.
Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.
 What is the BNC?  What is Xaira?  How to use the BNC for: › Language teaching and learning › Research.
Gene Expression Omnibus (GEO)
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.
5 Quick ways to improve content value do cool stuff using Calais.
Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Metadata Helen Aristar Dry Eastern Michigan University LINGUIST List.
SEASR Analytics for Zotero Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Search Engine Architecture
Extending Access To Information Resource Discovery Service William E. Moen, Ph.D. Kathleen R. Murray, Ph.D. School of Library and Information Sciences.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Copyright © 2006 Pilothouse Consulting Inc. All rights reserved. Office Server Specific Web content management –Page structure, layouts, and controls –Publishing.
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
SEASR Analytics Loretta Auvil Automated Learning Group Data-Intensive Technologies and Applications, National Center for Supercomputing.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
SEASR Analytics and Zotero University of Illinois at Urbana-Champaign.
Sharing Digital Scores: Will the Open Archives Initiative Protocol for Metadata Harvesting Provide the Key? Constance Mayer, Harvard University Peter Munstedt,
Standards for representing meeting metadata and annotations in meeting databases Standards for representing meeting metadata and annotations in meeting.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
GENERATING RELEVANT AND DIVERSE QUERY PHRASE SUGGESTIONS USING TOPICAL N-GRAMS ELENA HIRST.
Yahoo! BOSS Open up Yahoo!’s Search data via web services Developer & Custom Tracks Big Goal – If you’re in a vertical and you perform a search, you should.
Using JSTOR November What is JSTOR?JSTOR 2.JSTOR demonstration −Searching JSTOR −Format of the journal content −Using a MyJSTOR account to organize.
Taking the Library Back from Google Abe Lederman, President and CTO October 18-20, 2007.
Using JSTOR May What is JSTOR?JSTOR 2.JSTOR demonstration −Searching JSTOR −Format of the journal content −Linking to content on JSTOR 3.Help.
CONTENTdm A proven solution September A complete digital collection management software solution Stores, manages and provides access for all digital.
Information Retrieval in Practice
How to Use Google Scholar An Educator’s Guide
Using JSTOR May 2016.
Search Engine Architecture
An Overview of Data-PASS Shared Catalog
Using computers to search electronic databases
Data Management: Documentation & Metadata
Using JSTOR November 2013.
How to Use “Indian Citation Index (ICI)”
Digitometric Services for Open Archives Environments
The New Face of Information Retrieval: The Ankara University Open Access Platform Prof. Dr. Sekine Karakaş Prof. Dr. Doğan.
Introduction to Information Retrieval
PubMed Database Interface (Basic Course: Module 4)
Microsoft Azure Data Catalog
Presentation transcript:

l JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell

l Tools for Linguists Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot Clare Llewellyn U. Michigan Matthew Brook O’Donnell

l Data for Research Service The JSTOR archive: 4.8M journal articles 2.4M research articles 1.6M review articles ~14 billion words +31M pages of OCR’d text Multidisciplinary Content is organized into 50 disciplines High-quality bibliographic and structural metadata Including +40M parsed reference citations The Data for Research service brings much of this content into easy reach of researchers Powerful search tools Convenient data retrieval options

l Data for Research Service A self-serve tool for obtaining research data from the JSTOR archive Provided by a web-interface enabling researchers to identify content of interest in the JSTOR archive and to retrieve associated datasets for research purposes A researcher-oriented exploration tool complementing the search and browse capabilities offered by the JSTOR main site Exposes additional fields for enhanced searching and results filtering Provides data visualizations for viewing aggregate and document- level data Links to JSTOR main site are provided for documents in search results Authentication and authorization are required to view article contents

l Data for Research – Explore Tool

l Data for Research Service Applications Programming Interface (API) Provides support for programmatic searching and data retrieval Utilizes RESTful protocols for ease of use Plain URL requests, XML responses Standards-based search protocol SRU (Search and Retrieval via URL) Lightweight successor to Z39.50 protocol CQL (Contextual Query Language) Formal language defining search syntax Data retrieval using simple REST protocol Provides access to back-end content repository Resource Oriented Architecture (ROA) Stateless – requests contain all relevant information Uses HTTP methods (GET, POST) for operations ?view=

l Data for Research Service Data Views available in DfR Beta3 Bibliographic Metadata Dublin Core Word frequencies List of distinct words and their occurrence N-grams (specifically, word n-grams) An n-gram is a sub-sequence of n items from a given sequence Bigrams, trigrams, and quadgrams are provided by DfR Keywords Auto-extracted keywords based on their TF*IDF weight TF*IDF (Term Frequency * Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus References (citations out) Raw text for identified references

l Components for API Interaction ** Need to clarify the stuff from Bernie 1)Primary Component – JSTOR API interface 2)Persistent SEASR webservice 1)HTTP Listener 2)HTTP Responder

l Tools and resulting data most likely to be of interest to: Computational Linguists For use in range of NLP applications; large discipline-specific datasets open up incredible options in computational semantics, tagging, parsing, text-mining etc. numerous applications for a JSTOR-derived academic n-gram set (1 million 1960s BROWN corpus still used as source of frequency information!) Corpus and Applied Linguists The study of distinctive vocabulary and phraseology (lexical patterns of 2+ grams) in and across academic disciplines currently limited by lack and size of available data finding words and phrases distinctive to or strongly associated with specific disciplines (statistically identified ‘key words’) requires frequency information from large samples Need for discipline-specific frequency lists in teaching and testing of English for Academic Purposes (EAP)

l Workflow Define the search terms to create the data set(s) Submit a query to the JSTOR API and receive a response Download the data set(s) for one or more of the data views Conduct analysis using SEASR components Create visualizations using SEASR components

l Comparing the Data Different data views: Word counts Bigrams Trigrams Quadgrams Key terms References Different data sets: Different searches in JSTOR, different Journal Discipline Dates Compare your own data set with one from JSTOR Use Components to analyze or compare the data Calculate differences in sets Extract specific entities – example proper nouns Extract key differences

l Visualizing the Data Use the visualization capabilities already in SEASR to display results: Tables Graphs Clustering Dendograms

l Progress Defining what we wanted to do Looking at what is already available Discussions with SEASR folks Producing a shared area for work at UIUC Work on making the JSTOR API accessible Re-defining what we want to do!

l Experience SEASR staff very knowledgeable, helpful and responsive Learning curve Easy to do the simple stuff Can see the benefits of building our own components but can not find the time to learn the skills Difficult to assign time – really need to build it into another project

l Any questions / feedback? Contact details