Querying GrAF data in linguistic analysis

Slides:



Advertisements
Similar presentations
Can I Use It, and If so, How? Christian Lieske SAP AG – MultiLingual Technology Discussion of Consortium Proposal for OLIF2 File Header.
Advertisements

WDL Technical Architecture Working Group (TAWG) June 2010 Achievements and Recommendations Co-chaired by Noha Adly, Bibliotheca Alexandrina Babak Hamidzadeh,
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Putting together a METS profile. Questions to ask when setting down the METS path Should you design your own profile? Should you use someone elses off.
Alexandria Digital Library Project Integration of Knowledge Organization Systems into Digital Library Architectures Linda Hill, Olha Buchel, Greg Janée.
Mitglied der Leibniz-Gemeinschaft Querying Spoken Language Corpora Thomas Schmidt IDS Mannheim.
The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
Software Tools for Language Documentation DocLing 2013 Peter K. Austin Department of Linguistics, SOAS.
Advanced Metadata Usage Daan Broeder TLA - MPI for Psycholinguistics / CLARIN Metadata in Context, APA/CLARIN Workshop, September 2010 Nijmegen.
Multi-Model Digital Video Library Professor: Michael Lyu Member: Jacky Ma Joan Chung Multi-Model Digital Video Library LYU9904 Multi-Model Digital Video.
SRDC Ltd. 1. Problem  Solutions  Various standardization efforts ◦ Document models addressing a broad range of requirements vs Industry Specific Document.
TLA/CLARIN CLAVAS Use Cases: Overview CMDI integration – Metadata editing Resource Annotation Kinship data.
The current state of Metadata - as far as we understand it - Peter Wittenburg The Language Archive - Max Planck Institute CLARIN Research Infrastructure.
1 Uppsala University Library Eva Müller Peter Hansson Stefan Andersson Uwe Klosa Electronic Publishing Centre Krister Östlund Waller project.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
XML Parsing Using Java APIs AIP Independence project Fall 2010.
Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
IBM User Technology March 2004 | Dynamic Navigation in DITA © 2004 IBM Corporation Dynamic Navigation in DITA Erik Hennum and Robert Anderson.
Strategies for Building Successful Digital Initiatives at Small to Medium Size Institutions Rachel Frick & Andrew Rouner.
Digital Encoding What’s behind E-text Resources?.
PREMIS Tools and Services Rebecca Guenther Network Development & MARC Standards Office, Library of Congress NDIIPP Partners Meeting July 21,
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
CLARIN web services and workflow Marc Kemps-Snijders.
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
GRITS Working with AVM Data Astronomy Visualization Metadata June 11th, 2010 Casey Rosenthal
LexEVS 6.0 Overview Scott Bauer Mayo Clinic Rochester, Minnesota February 2011.
ISOcat demo and providing RELcat input Menzo Windhouwer The Language Archive tla.mpi.nl Data Archiving and Networked Solutions
Max Planck Institute for Psycholinguistics Tool development report H. Brugman MPI Nijmegen.
MPEG-21 : Overview MUMT 611 Doug Van Nort. Introduction Rather than audiovisual content, purpose is set of standards to deliver multimedia in secure environment.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Scalable Metadata Definition Frameworks Raymond Plante NCSA/NVO Toward an International Virtual Observatory How do we encourage a smooth evolution of metadata.
TEI and Scholarly publishing Laurent Romary INRIA & HUB-ISDL TEI council, chair.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Jan 9, 2004 Symposium on Best Practice LSA, Boston, MA 1 Comparability of language data and analysis Using an ontology for linguistics Scott Farrar, U.
Antoine Isaac 1 st PRELIDA Workshop Pisa, June 26, 2013.
The digital Curriculum - The Curriculum as a service on the Semantic Web The Annofolio project.
Strategies for Adding EML Support to the GCE Data Toolbox for Matlab Wade Sheldon Georgia Coastal Ecosystems LTER (WWW: gce-lter.marsci.uga.edu/lter)
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
VAMDC infrastructure VAMDC 7th Developer’s workshop Guy Rixon.
SIL FieldWorks Language Explorer: The lexicon component Gary Simons SIL International Lexicon Tools and Lexicon Standards Nijmegen, 4–5 August 2010.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Beyond ISOcat 20 June 2013CLARIN-NL ISOcat tutorial1.
1 CLARIN - NL What is going on? Jan Odijk Amsterdam 26 Aug 2010.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Metadata, Resources, and the RDF 김민수 Chapter 1. Creating the Sementic Web with RDF2 Overview Knowledge Representation Library Metadata RDFRDF.
MMDB-9 J. Teuhola Standardization: MPEG-7 “Multimedia Content Description Interface” Standard for describing multimedia content (metadata).
Comanche A GUI management tool for Apache Daniel López Ridruejo
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
Creating & Testing CLARIN Metadata Components A CLARIN-NL project Folkert de Vriend Meertens Institute, Amsterdam 18/05/2010.
ESRI Education User Conference – July 6-8, 2001 ESRI Education User Conference – July 6-8, 2001 Introducing ArcCatalog: Tools for Metadata and Data Management.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
Annotation by category – ELAN and ISO DCR Han Slöetjes, Peter Wittenburg Max-Planck-Institute for Psycholinguistics LREC,
Formats, interoperability and standards Marc Kemps-Snijders.
Using DSDL plus annotations for Netconf (+) data modeling Rohan Mahy draft-mahy-canmod-dsdl-01.
DocLing2016 Software Tools Peter K. Austin Department of Linguistics SOAS, University of London
Digital Data Preservation: a schema-driven model Student: Stacy Kowalczyk Co-Authors: Clare McInerney and Phil Mitchell Digital Data Preservation – the.
A Data Category Registry- and Component- based Metadata Framework Daan Broeder et al. Max-Planck Institute for Psycholinguistics LREC 2010.
TEI presentation for IS 590 Robert Patrick Waltz July 10 th, 2012.
ELAN as a tool for oral history CLARIN Oral History Workshop Oxford Sebastian Drude CLARIN ERIC 18 April 2016.
BRAT: a web based tool for manual annotation Hans Paulussen ITEC, KU Leuven KULAK.
Introduction to MPEG  Moving Pictures Experts Group,  Geneva based working group under the ISO/IEC standards.  In charge of developing standards for.
Introduction  Model contains different kinds of elements (such as hosts, databases, web servers, applications, etc)  Relations between these elements.
Chris Menegay Sr. Consultant TECHSYS Business Solutions
The Re3gistry software and the INSPIRE Registry
XML Data Introduction, Well-formed XML.
PREMIS Tools and Services
Oya Y. Rieger Cornell University Library May 2004
Presentation transcript:

Querying GrAF data in linguistic analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu

Overview Existing infrastructure and workflows GrAF GrAF and TEI Poio API Queries in Poio API Queries in GrAF API

Fieldwork Fotos

Existing Infrastructure

LD tools and standards Elan: EAF, MPEG, WAV Toolbox: TXT, XML, WAV Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) Praat: XML, WAV ... No standards for tier hierarchies, tier names or annotation schemes Efforts in ISOcat

Interlinear Glossed Text

GrAF GrAF: Graph Annotation Framework ISO 24612: Language resource management - Linguistic annotation framework (LAF) Started as stand-off version of XCES API and representation as data structures, not a file format GrAF/XML as XML representation Used for the MASC of the ANC Nodes, edges, regions, annotations, feature structures

GrAF entities

GrAF structure

GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>

TEI and GrAF Schemata for GrAF created with TEI Roma Custumized version of TEI P5 schema ODD: „One Document Does it all“ GrAF is not TEI compliant Share data types and feature structures of annotations TEI has „stand-off“ variant, uses XPointer/XLink Primary data has to be XML

Why we use GrAF No inline markup Radical stand-off approach Easier to share and manage data Preferred solution to archive cultural heritage Ideal for sparse annotations Existing code: Java and Python API vs. XQuery The beauty of annotation graphs

Poio API Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages Subset of GrAF to represent tier based annotation Filters and filter chains for search Plugin mechanism for file formats Mapping semantics: tiers and annotations to nodes and edges Efforts to map between TEI and GrAF Retro-digitized dictionary data at University of Marburg are published as GrAF files We want to publish as TEI

Queries in GrAF API All queries are in-memory Users can load parts of the full graph Annotation graph to network conversion Python library networkx Example: Semantic similarity

Queries in GrAF API for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or \ e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]

Queries in Poio API Example: Word order in Hinuq

Queries in Poio API ag = from_excel("data/Hinuq2.csv") clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1

Filters and filter chains ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"\bpro\b") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"\bpro\b" } af = ag.create_filter_for_dict(search_terms)

Poio Analyzer Developed for and with Prof. Johannes Helmbrecht, University of Regensburg How to query the corpus in order to write a descriptive grammar? Started with a list of requirements Need to publish and archive queries and results

Poio Analyzer

Thank you for your attention! pbouda@cidles.eu

Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3- linguistic-fieldwork-anthropology-language-typology/curation-project-1.html Poio: http://media.cidles.eu/poio/ GrAF: http://www.xces.org/ns/GrAF/1.0/