L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

l JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell

l Tools for Linguists Aim: To create a set of workflows that can extract data from JSTOR, then process or visualize this data in ways that are useful for linguists. Participants: JSTOR Michael Krot Clare Llewellyn U. Michigan Matthew Brook O’Donnell

l Data for Research Service The JSTOR archive: 4.8M journal articles 2.4M research articles 1.6M review articles ~14 billion words +31M pages of OCR’d text Multidisciplinary Content is organized into 50 disciplines High-quality bibliographic and structural metadata Including +40M parsed reference citations The Data for Research service brings much of this content into easy reach of researchers Powerful search tools Convenient data retrieval options

l Data for Research Service A self-serve tool for obtaining research data from the JSTOR archive Provided by a web-interface enabling researchers to identify content of interest in the JSTOR archive and to retrieve associated datasets for research purposes A researcher-oriented exploration tool complementing the search and browse capabilities offered by the JSTOR main site Exposes additional fields for enhanced searching and results filtering Provides data visualizations for viewing aggregate and document- level data Links to JSTOR main site are provided for documents in search results Authentication and authorization are required to view article contents

l Data for Research – Explore Tool

l Data for Research Service Applications Programming Interface (API) Provides support for programmatic searching and data retrieval Utilizes RESTful protocols for ease of use Plain URL requests, XML responses Standards-based search protocol SRU (Search and Retrieval via URL) Lightweight successor to Z39.50 protocol CQL (Contextual Query Language) Formal language defining search syntax Data retrieval using simple REST protocol Provides access to back-end content repository Resource Oriented Architecture (ROA) Stateless – requests contain all relevant information Uses HTTP methods (GET, POST) for operations http://dfr.jstor.org/resource/ ?view=

l Data for Research Service Data Views available in DfR Beta3 Bibliographic Metadata Dublin Core Word frequencies List of distinct words and their occurrence N-grams (specifically, word n-grams) An n-gram is a sub-sequence of n items from a given sequence Bigrams, trigrams, and quadgrams are provided by DfR Keywords Auto-extracted keywords based on their TF*IDF weight TF*IDF (Term Frequency * Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus References (citations out) Raw text for identified references

l Components for API Interaction ** Need to clarify the stuff from Bernie 1)Primary Component – JSTOR API interface 2)Persistent SEASR webservice 1)HTTP Listener 2)HTTP Responder

l Tools and resulting data most likely to be of interest to: Computational Linguists For use in range of NLP applications; large discipline-specific datasets open up incredible options in computational semantics, tagging, parsing, text-mining etc. numerous applications for a JSTOR-derived academic n-gram set (1 million 1960s BROWN corpus still used as source of frequency information!) Corpus and Applied Linguists The study of distinctive vocabulary and phraseology (lexical patterns of 2+ grams) in and across academic disciplines currently limited by lack and size of available data finding words and phrases distinctive to or strongly associated with specific disciplines (statistically identified ‘key words’) requires frequency information from large samples Need for discipline-specific frequency lists in teaching and testing of English for Academic Purposes (EAP)

l Workflow Define the search terms to create the data set(s) Submit a query to the JSTOR API and receive a response Download the data set(s) for one or more of the data views Conduct analysis using SEASR components Create visualizations using SEASR components

l Comparing the Data Different data views: Word counts Bigrams Trigrams Quadgrams Key terms References Different data sets: Different searches in JSTOR, different Journal Discipline Dates Compare your own data set with one from JSTOR Use Components to analyze or compare the data Calculate differences in sets Extract specific entities – example proper nouns Extract key differences

l Visualizing the Data Use the visualization capabilities already in SEASR to display results: Tables Graphs Clustering Dendograms

l Progress Defining what we wanted to do Looking at what is already available Discussions with SEASR folks Producing a shared area for work at UIUC Work on making the JSTOR API accessible Re-defining what we want to do!

l Experience SEASR staff very knowledgeable, helpful and responsive Learning curve Easy to do the simple stuff Can see the benefits of building our own components but can not find the time to learn the skills Difficult to assign time – really need to build it into another project

l Any questions / feedback? Contact details michael.krot@jstor.org clare.llewellyn@jstor.org mbod@umich.edu

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Similar presentations

Presentation on theme: "L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell.

Similar presentations

Presentation on theme: "L JSTOR Tools for Linguists 22nd June 2009 Michael Krot Clare Llewellyn Matt O’Donnell."— Presentation transcript:

Similar presentations

About project

Feedback