Querying GrAF data in linguistic analysis Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social pbouda@cidles.eu
Overview Existing infrastructure and workflows GrAF GrAF and TEI Poio API Queries in Poio API Queries in GrAF API
Fieldwork Fotos
Existing Infrastructure
LD tools and standards Elan: EAF, MPEG, WAV Toolbox: TXT, XML, WAV Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) Praat: XML, WAV ... No standards for tier hierarchies, tier names or annotation schemes Efforts in ISOcat
Interlinear Glossed Text
GrAF GrAF: Graph Annotation Framework ISO 24612: Language resource management - Linguistic annotation framework (LAF) Started as stand-off version of XCES API and representation as data structures, not a file format GrAF/XML as XML representation Used for the MASC of the ANC Nodes, edges, regions, annotations, feature structures
GrAF entities
GrAF structure
GrAF-XML <node xml:id="words..W-Words..na23"> <link targets="words..W-Words..ra23"/> </node> <region anchors="780 1340" xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>
TEI and GrAF Schemata for GrAF created with TEI Roma Custumized version of TEI P5 schema ODD: „One Document Does it all“ GrAF is not TEI compliant Share data types and feature structures of annotations TEI has „stand-off“ variant, uses XPointer/XLink Primary data has to be XML
Why we use GrAF No inline markup Radical stand-off approach Easier to share and manage data Preferred solution to archive cultural heritage Ideal for sparse annotations Existing code: Java and Python API vs. XQuery The beauty of annotation graphs
Poio API Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages Subset of GrAF to represent tier based annotation Filters and filter chains for search Plugin mechanism for file formats Mapping semantics: tiers and annotations to nodes and edges Efforts to map between TEI and GrAF Retro-digitized dictionary data at University of Marburg are published as GrAF files We want to publish as TEI
Queries in GrAF API All queries are in-memory Users can load parts of the full graph Annotation graph to network conversion Python library networkx Example: Semantic similarity
Queries in GrAF API for (node_id, node) in graf_graph.nodes.items(): if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or \ e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]
Queries in Poio API Example: Word order in Hinuq
Queries in Poio API ag = from_excel("data/Hinuq2.csv") clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1
Filters and filter chains ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"\bpro\b") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"\bpro\b" } af = ag.create_filter_for_dict(search_terms)
Poio Analyzer Developed for and with Prof. Johannes Helmbrecht, University of Regensburg How to query the corpus in order to write a descriptive grammar? Started with a list of requirements Need to publish and archive queries and results
Poio Analyzer
Thank you for your attention! pbouda@cidles.eu
Links Clarin curation project: http://de.clarin.eu/en/discipline-specific-working-groups/wg-3- linguistic-fieldwork-anthropology-language-typology/curation-project-1.html Poio: http://media.cidles.eu/poio/ GrAF: http://www.xces.org/ns/GrAF/1.0/