Jan Odijk Birmingham, 2017-07-24 Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24.

Jan Odijk Birmingham, 2017-07-24
Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham,

Overview Introduction CLARIN-NL CLARIAH-CORE Conclusions NLP Tools
Dedicated Applications 5 Example Cases CLARIAH-CORE Text  Structured Data CLARIAH-eScience Projects Research Pilots Conclusions

Introduction CLARIN: European research infrastructure for researchers who work with language resources DARIAH: European research infrastructure for researcher for the Arts and Humanities

Introduction CLARIN-NL 2009 -2015
CLARIN-TCC Talk of Europe 3 international creative camps around the European Parliament Data curated as Linked Open Data CLARIAH contributes to CLARIN + DARIAH CLARIAH-SEED CLARIAH-CORE 3 core disciplines (linguistics, social economic history, media studies) CLARIAH-PLUS (submitted)

Introduction Independent but Related projects
CKCC (Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic) Developed ePistolarium, a web-based Humanities’ Collaboratory on Correspondences Main funding from NWO, CLARIN-NL funded a small part Nederlab Develops a web application for the longitudinal study of Dutch language, literature and culture Main funding from NWO, CLARIAH funds small part

Introduction < 2009: Use of corpora and use of NLP tools limited to computational and corpus linguists CLARIN-NL and CLARIAH: make these accessible to and usable by other humanities researchers

NLP Tools TTNWW project (together with Flanders)
Predefined workflows of NLP tools as web services In a easy to use web application Upload data, push button, download results All based on CLAM web service mediator And FoLiA format for linguistically annotated corpora (de facto standard in NL).

NLP Tools TTNWW pre-defined workflows:
Orthographic normalisation (spelling & OCR correction) through TICLLops Tokenisation, lemmatisation, pos-tagging, named entity recognition (NER), limited multiword unit recognition, limited dependency relations (Frog) Tokenisation, lemmatisation, pos-tagging, limited multiword unit recognition, full syntactic parses, limited NER (Alpino) Semantic role labelling Co-reference assignment Speech conversion and transcription

NLP Tools Other NLP tools Adelheid: pos-tagger for 13th century Dutch
INPOLDER: (experimental) parser for 13th century Dutch FrogGen: user-trainable Frog (language-independent) Tested on Classical Greek Soon to be tested on 17th Century Dutch WordVec, GloVE semantic spaces based on Dutch SoNaR corpus

Dedicated Applications
Dedicated (search) applications or (meta)data standardised: Literary studies: COBWWWEB, BNM-I, Arthurian Fiction, C-DSD Song Database (Liederenbank), EMIT-X History: INTER-VIEWS / IPNV, Oral History ( CLARIAH Media Suite), Verrijkt Koninkrijk (VK), Dutch Ships and Sailors (DSS), CKCC

Dedicated Applications
Political Science: War in Parliament (WIP), CLARIN-TCC Talk of Europe Media Studies: Polimedia, AVResearcherXL & TrOve ( Media Suite), NISV Academia collection Religion Studies: PILNAR

Case 1: Linguists Parse and Query (PaQu) and GrETEL
Access to LASSY and Spoken Dutch Corpus treebanks Upload one’s own (Dutch) corpus, have it parsed and made searchable Search Via a dedicated interface for grammatical dependencies (PaQu) Via an example-based interface (GrETEL) And via XPATH queries (both) Extensive Analysis options on data and metadata Support for multiple formats (FoLiA, TEI, plain text, CHAT, …) Led to a lot of research Augustinus Van Eynde et al. (2016), … Bloem Odijk 2015, Van Noort & Odijk 2016, Odijk 2017, Odijk & al 2017 Lectures in Utrecht  incorporated in regular linguistics curriculum

Case 1: Linguists OpenSoNaR & AutoSearch Access to
Token-annotated SoNaR written Dutch corpus (540m) One’s own token-annotated corpus (AutoSearch) Exploration interface Multiple search interfaces of varying complexity Upgrade + access to full Spoken Dutch Corpus in OpenSoNaR+ (to be released in autumn 2017)

Case 2: Philosophers @PHILOSTEI:
Philosopher & computational linguist OCR-correction, conversion to TEI for (non-Dutch) philosophical works Based on TICCLops, extended and made language-independent Basis for VICI project by Arianna Betti (UvA) Ideas at scale – Towards a computational history of ideas (e-Ideas) a tool that allows you to trace how ideas such as tolerance, evolution, or science have changed throughout history Imagine if there was a ‘Google Concepts’, a tool that allowed you to trace how ideas such as tolerance, evolution, or science have changed throughout history in all digital texts available to you. In this project, Betti will establish the proper methodological foundations to make such a tool possible in the future

Case 3: Historians WAHSP / BILAND: CLARIN-NL text mining applications, replaced by TexCavator Basis for NWO Horizon Translantis project by the same research team uses digital humanities tools to analyze how the United States has served as a cultural model for the Netherlands in the long twentieth century And to the ShiCo project (with NL eScience Centre) Mining shifting concepts through time Texcavator allows you to use full-text search on the newspaper archive of the Dutch Royal Library within the date range On top of that, it offers visualizations like word clouds, time lines and heat maps. It also provides services to enhance your search experience like filtering, stop word removal, normalization and stemming ShiCo: The scientific goal of this project is to develop a tool that enables humanities researchers to mine the historical development of concepts and the vocabulary with which they are expressed in big textual data repositories. Recent research suggests that vector representations derived by neural network language models offer new possibilities for obtaining high quality semantic representations from huge data sets

Case 4: Literary Scholars
NameScape Search and visualise Named Entities in modern Dutch novels NE Recognition in one’s own corpora Through a web application with a dedicated interface for literary scholars

Case 5: Linguists + Literary Scholars
Language Dynamics of the Dutch Golden Age language innovations partly driven by migration, literary innovations and standardisation processes Variation within authors and genres Closely collaborates with Nederlab Uses CLARIN standards and tools FoLiA, FrogGen, … AutoSearch

Case 5: Linguists + Literary Scholars
Texcavator allows you to use full-text search on the newspaper archive of the Dutch Royal Library within the date range On top of that, it offers visualizations like word clouds, time lines and heat maps. It also provides services to enhance your search experience like filtering, stop word removal, normalization and stemming ShiCo: The scientific goal of this project is to develop a tool that enables humanities researchers to mine the historical development of concepts and the vocabulary with which they are expressed in big textual data repositories. Recent research suggests that vector representations derived by neural network language models offer new possibilities for obtaining high quality semantic representations from huge data sets

Other Examples More linguistic search / analysis applications:
MIMORE Search / analysis in multiple dialectal databases / corpora FESLI Search in enriched Specific Language Impairment (SLI) corpora COAVA Combined search in dialect lexicons and CHILDES corpora Stylene System for stylometry and readability research Religion Studies: SHEBANQ A web application to perform linguistic queries on the WIVU Hebrew Text Database

CLARIAH-CORE Core disciplines: linguistics, social economic history, media studies Cross-discipline information extraction from text (text -> structured data) Research Pilots Projects with eScience Centre

Text  Structured Data If buildings could talk Distilling careers
we explore linking the enriched buildings dataset to information extracted from newspapers, aiming to build towards a rich and varied source on the history of buildings. Distilling careers augmenting biographies with occupational information based on HISCO by an occupation tagger Experiments in fine-grained entity typing for Dutch Fine-grained tagger (59 / 269 NE types) Can identify mentions in text and link them to concepts in a resource Currently set up to detect mentions of occupations from HISCO

CLARIAH-eScience ADAH Call (Accelerating Discovery in the Arts and Humanities) Bridging the gap: Digital Humanities and the Arabic-Islamic corpus seeks to develop a web-based application that will enable easy access to existing Arabic corpora on online repositories and offer the opportunity for researchers to upload their own corpus offer a set of tools for Arabic text mining and computational analysis, and provide opportunities to link search results to other datasets in Islamic and Middle Eastern Studies. of Brill Publishers, Europe’s leading publisher in this area

CLARIAH-eScience TICCLAT: Text-Induced Corpus Correction and Lexical Assessment Tool Builds on TICCL extend TICCL's correction capabilities with classification facilities based on Nederlab corpus data: word statistics, document and time references and linguistic annotations, i.e. Part-of-Speech and Named-Entity labels.

CLARIAH-eScience EViDENse: Ego Documents Events modelliNg - how individuals recall mass violence new ways of analysing and contextualising historical sources by applying state-of-the-art entity and event modelling and semantic web technologies. Tested in two case studies a synchronic analysis of WW2 events, centered around the oral history collection ‘Getuigenverhalen’ [1] and using the WW2 thesaurus [2] a diachronic analysis of ego-documents ( ) from Nederlab [3]. In both cases, we use content-related contextual sources from Nederlab [4].

CLARIAH-eScience NewsGac: News Genres: Advancing Media History by Transparent Automatic Genre Classification Automatic genre detection in newspapers and television news using machine learning. revises our current understanding of the interrelated development of genre conventions in print and television journalism; Metrics and guidelines for evaluating the bias and error of the different preprocessing and machine learning approaches and of-the-shelf software packages; A dashboard that integrates, compares and visualises different algorithms and underlying machine learning approaches which can be integrated in the CLARIAH media suite.

Research Pilots DB-CCC: Diamonds in Borneo: Commodities as Concepts in Context detect the diamond mining, manufacturing and trading places and people in Borneo based on a selection of texts from Delpher using Entity recognition, classification & linking and Ontotagger HHUCAP: The History of Human Capital Robust Semantic Parsing and Linked Data conversion tools to automatically derive career patterns from 35,000 biographies in the Biography Portal in the period

Research Pilots LinkSyr: Linking Syriac Data
How do the Biblical heritage and Hellenistic culture interact in the oldest documents of Syriac Christianity? compare the Hebrew Bible and its ancient Syriac translation (the Peshitta) with the Syriac Book of the Laws of the Countries (ca. 200 AD) using linguistic data processing, especially topic modelling. contribute to our understanding of processes of remediation between television and print journalism, which are often hypothesized or taken for granted, but rarely studied empirically; test the functionalities of AVResearcherXL to systematically compare the development of newspaper and television content; develop new functionalities for AVResearcherXL that will allow for better search and visualization, and will enhance the feasibility of a systematic comparative analysis of newspapers and television coverage.

Research Pilots SERPENS: Contextual search and analysis of pest and nuisance species through time in the KB newspaper collection SERPENS aims to study the historical impact of pest and nuisance species on human practices and changes in the public perception of these animals. The KB newspaper collection will be primary source of information to study this. Problems: spelling variations, vernacular vs. Latin names, ambiguity To remedy this, the WP2-3 diachronic lexicons will be used for query expansion in combination with topic modelling to filter out irrelevant results.

Conclusions CLARIN-NL & CLARIAH projects
Enabled and stimulated use of corpus and computational linguistic methods and tools in other humanities disciplines Many projects successfully finished Many still ongoing or about to start

More information http://portal.clarin.nl http://www.clariah.nl
Odijk & Van Hessen (eds.) to appear CLARIN in the Low Countries. London: Ubiquity Press. (Open Access). Spyns & Odijk (eds.) Essential Speech and Language Technology for Dutch. Berlin: Springer. Open Access DOI: /

Thanks for your attention

Jan Odijk Birmingham, 2017-07-24 Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24.

Similar presentations

Presentation on theme: "Jan Odijk Birmingham, 2017-07-24 Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jan Odijk Birmingham, 2017-07-24 Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24.

Similar presentations

Presentation on theme: "Jan Odijk Birmingham, 2017-07-24 Corpus and Computational Linguistic Methods and Tools beyond corpus linguistics in CLARIAH Jan Odijk Birmingham, 2017-07-24."— Presentation transcript:

Similar presentations

About project

Feedback