Presentation is loading. Please wait.

Presentation is loading. Please wait.

Querying GrAF data in linguistic analysis

Similar presentations


Presentation on theme: "Querying GrAF data in linguistic analysis"— Presentation transcript:

1 Querying GrAF data in linguistic analysis
Peter Bouda Centro Interdisciplinar de Documentação Linguística e Social

2 Overview Existing infrastructure and workflows GrAF GrAF and TEI
Poio API Queries in Poio API Queries in GrAF API

3 Fieldwork Fotos

4 Existing Infrastructure

5 LD tools and standards Elan: EAF, MPEG, WAV Toolbox: TXT, XML, WAV
Arbil: IMDI/CIMDI („Component MetaData Infrastructure“) Praat: XML, WAV ... No standards for tier hierarchies, tier names or annotation schemes Efforts in ISOcat

6 Interlinear Glossed Text

7 GrAF GrAF: Graph Annotation Framework
ISO 24612: Language resource management - Linguistic annotation framework (LAF) Started as stand-off version of XCES API and representation as data structures, not a file format GrAF/XML as XML representation Used for the MASC of the ANC Nodes, edges, regions, annotations, feature structures

8 GrAF entities

9 GrAF structure

10 GrAF-XML <node xml:id="words..W-Words..na23">
<link targets="words..W-Words..ra23"/> </node> <region anchors=" " xml:id="words..W-Words..ra23"/> <edge from="utterance..W-Spch..n8" to="words..W-Words..na23" xml:id="ea23"/> <a as="words" label="words" ref="words..W-Words..na23" xml:id="a23"> <fs> <f name="annotation_value">so</f> </fs> </a>

11 TEI and GrAF Schemata for GrAF created with TEI Roma
Custumized version of TEI P5 schema ODD: „One Document Does it all“ GrAF is not TEI compliant Share data types and feature structures of annotations TEI has „stand-off“ variant, uses XPointer/XLink Primary data has to be XML

12 Why we use GrAF No inline markup Radical stand-off approach
Easier to share and manage data Preferred solution to archive cultural heritage Ideal for sparse annotations Existing code: Java and Python API vs. XQuery The beauty of annotation graphs

13 Poio API Think of GrAF as an assembly language for linguistic annotation; then Poio API is a libray to map from and to higher-level languages Subset of GrAF to represent tier based annotation Filters and filter chains for search Plugin mechanism for file formats Mapping semantics: tiers and annotations to nodes and edges Efforts to map between TEI and GrAF Retro-digitized dictionary data at University of Marburg are published as GrAF files We want to publish as TEI

14 Queries in GrAF API All queries are in-memory
Users can load parts of the full graph Annotation graph to network conversion Python library networkx Example: Semantic similarity

15 Queries in GrAF API for (node_id, node) in graf_graph.nodes.items():
if node_id.endswith("entry"): for e in node.out_edges: if e.annotations.get_first().label == "head" or \ e.annotations.get_first().label == "translation": features = e.to_node.annotations.get_first().features substr = features.get_value("substring") [...]

16 Queries in Poio API Example: Word order in Hinuq

17 Queries in Poio API ag = from_excel("data/Hinuq2.csv")
clause_unit_nodes = ag.nodes_for_tier("clause_id") verbs = [ 'COP', 'cop', 'SAY', 'say', 'v.tr', 'v.intr', 'v.aff' ] others = [ 'A', 'S', 'P', 'EXP', 'STIM' ] search_terms = verbs + others word_orders = collections.defaultdict(int) for parent_node in clause_unit_nodes: word_order = [] for word_n in parent_node.iter_children(): a_list = ag.annotations_for_tier("grammatical_relation", word_n) if len(a_list) > 0: a_value = ag.annotation_value_for_annotation(a_list[0]) if a_value in search_terms: if a_value in verbs: word_order.append('V') else: word_order.append(a_value) word_orders[tuple(word_order)] += 1

18 Filters and filter chains
ag = poioapi.annotationgraph.AnnotationGraph() ag.from_elan("elan-example3.eaf") ag.structure_type_handler = poioapi.data.DataStructureType(ag.tier_hierarchies[0]) af = poioapi.annotationgraph.AnnotationGraphFilter(ag) af.set_filter_for_tier("words..W-Words", "follow") af.set_filter_for_tier("part_of_speech..W-POS", r"\bpro\b") ag.append_filter(af) print("Filtered root nodes:") print(ag.filtered_node_ids) search_terms = { "words..W-Words": "follow", "part_of_speech..W-POS": r"\bpro\b" } af = ag.create_filter_for_dict(search_terms)

19 Poio Analyzer Developed for and with Prof. Johannes Helmbrecht, University of Regensburg How to query the corpus in order to write a descriptive grammar? Started with a list of requirements Need to publish and archive queries and results

20 Poio Analyzer

21 Thank you for your attention! pbouda@cidles.eu

22 Links Clarin curation project: linguistic-fieldwork-anthropology-language-typology/curation-project-1.html Poio: GrAF:


Download ppt "Querying GrAF data in linguistic analysis"

Similar presentations


Ads by Google