Presentation is loading. Please wait.

Presentation is loading. Please wait.

Search in Token-annotated Corpora Search in Treebanks

Similar presentations


Presentation on theme: "Search in Token-annotated Corpora Search in Treebanks"— Presentation transcript:

1 LLOD Use Case: Syntactic Search Jan Odijk CLARIAH-CORE LD4LR Workshop Utrecht, 2017-02-06/07

2 Search in Token-annotated Corpora Search in Treebanks
Overview Search in Token-annotated Corpora Search in Treebanks

3 Search in Token-annotated Corpora Search in Treebanks
Overview Search in Token-annotated Corpora Search in Treebanks

4 Corpus Search Token-Annotated SONAR (535 m tokens, Dutch) Pos-tagged
Encoded in FoLiA Search OpenSONAR 4 search interfaces, of increasing complexity Expert: CQP queries

5 Other Corpora Search- all use CQP. Token-Annotated Own Corpora BNC
Contemporary Dutch Corpus Search- all use CQP. AutoSearch BNC Lancaster Contemporary Dutch Corpus

6 LOD? Token-Annotated Will it bring advantages? If so, which ones?
Does it retain the power and simple notation of CQP? SPARQL queries? REs over token descriptions?

7 Search in Token-annotated Corpora Search in Treebanks
Overview Search in Token-annotated Corpora Search in Treebanks

8 Treebanks Treebank = text corpus in which each sentence has been assigned a syntactic structure I use CGN, LASSY, CHILDES for Dutch LINDAT/CLARIN for many different languages Tündra for (mainly) German INESS treebanks for multiple languages Query languages: CGN, LASSY, CHILDES: XPATH/XQUERY LINDAT/CLARIN: PML-TQ Tündra: Tiger INESS: Tiger

9 Treebanks Dedicated search applications: GrETEL PaQU Performance:
Example-based search & XPATH PaQU Dedicated search for dependencies & XPATH Performance: OK for 65k sent /1 M token corpora Too slow for 7 M sent corpora (and getting slower every 18 months)

10 Treebanks LOD: Could it be used to overcome the many different query languages in use? Query language? Same potential, transparent notation? Query language syntax NO problem Queries get very complex very quickly Must know the structure of the syntactic structures in every fine detail

11 Treebanks LOD: Linking to other resources. Combined syntactic/morphological/semantic search Wordnet for checking for semantic properties (mass/count, human/nonhuman) CELEX for morphological/phonological properties performance?

12 Thanks for your attention


Download ppt "Search in Token-annotated Corpora Search in Treebanks"

Similar presentations


Ads by Google