Presentation is loading. Please wait.

Presentation is loading. Please wait.

Darja Fišer DARIAH UZH Zurich, 18 December 2017

Similar presentations


Presentation on theme: "Darja Fišer DARIAH UZH Zurich, 18 December 2017"— Presentation transcript:

1 Darja Fišer DARIAH Day @ UZH Zurich, 18 December 2017
Veni, vidi, CLARIN! Darja Fišer DARIAH UZH Zurich, 18 December 2017 CC-BY 4.0

2 Overview Intro to CLARIN CLARIN data architecture
CLARIN for data science

3 Intro to CLARIN

4 CLARIN in seven bullets
CLARIN is the Common Language Resources and Technology Infrastructure ESFRI ERIC status since 2012, Landmark since 2016 that provides easy and sustainable access for scholars in the humanities and social sciences and beyond to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, wherever they are located through a single sign-on environment and that serves as an ecosystem for knowledge sharing.

5 CLARIN ERIC in members and centres
A consortium of: 19 members: AT, BG, CZ, DE, DK, DLU, EE, FI, GR, HU, IT, LT, LV, NL, NO, PL, PT, SE, SI 2 observers: FR, UK; >40 centres

6 What CLARIN Centres offer
Repository library of linguistic data and tools search for data and tools and easily use them online or download them deposit your data and be sure it is safely stored, everyone can find it, and correctly cite it Federated single sign-on log in once with your existing institutional credentials get access to protected resources Metadata describe content, provenance and formats of linguistic data and tools facilitate preservation and dissemination of linguistic data and tools Persistent Identifier (PID or handle) a special permanent URL that provides a permanent link to linguistic data and tools will resolve correctly even if in some distant future the data is moved should be used as URL in citations Licensing Public Academic Restricted Preservation (Data Seal of Approval) committed to long-term care of items in the repository ensure the archived data can be found, understood and used in the future

7 CLARIN data types and user communities
Newspaper archives Literary texts Parliamentary records Historical letters Broadcast archives Oral History data Social Media data Digital humanities Linguistics and Philology Translation and Lexicography Literary Studies History Political and Social Sciences Media Studies Culture, Folklore, Anthropology Speech therapy Teachers General Public

8 CLARIN data architecture

9 Repositories * slides by Dieter Van Uytvanck

10 Harvesting

11 Processing

12 Content search

13 Workflows

14 CLARIN for data science

15 CLARIN and data science (1)
Text and speech as social and cultural data Contribution to the development of new methodological frameworks for the integrated processing of multiple datatypes, and multidisciplinary research agendas Europe’s multilinguality as a basis for comparative research of societal and cultural phenomena, that are reflected in language use: Migration patterns Intellectual history Language variation across period and region Dynamics in mental health conditions Parliamentary discourse

16 Parliamentary records
great potential for reuse and re-purposing within many fields of study in the humanities and social sciences (and beyond): suited for both close reading and distance reading Humanities: history, language change, discourse analysis … Social sciences: social and cultural dynamics, political sciences, economics ... considered a rich data type apart from linguistic content, rich in metadata (speaker, party affiliation, age, sex, education, origin, duration of speech) apart from linguistic content, rich in extralinguistic clues (interruptions, voting results) made easily available under the Freedom of Information acts in over 100 countries all around the world to enable informed participation by the public and improve effective functioning of democratic systems but also often presenting itself as messy or noisy data calling for links with data in other modalities than text and speech created under specific circumstances that need to be well understood before strong conclusions can be drawn

17 Corpora of parliamentary records
Coverage exist for 18 countries Size (in tokens) largest: UK (1.6 billion) smallest: Portuguese (1 million) Periods covered by the corpus mostly 2nd half of 20th century and 21st century, Dutch and British corpora from early 19th century Availability For download (7) at, cz [CPM], dk, de [sample only], no [ToN], pt, lv For on-line searching (7) Finnish (KORP) CzechParl (SketchEngine) Latvian (noSketchEngine) Bulgarian (CLaRK) Hungarian (HNC, registration required) Proceedings of Norwegian Parliamentary Debates (Corpuscle) Both for download and on-line searching (5) Dutch (Political Mashup) Estonian (Keeleveeb) Swedish (KORP) Slovenian (noSketchEngine) Polish (NKJP) Full overview available here

18 CLARIN’s Parliamentary data for many disciplines
Perspective of curators and researchers: Historical perspective: the specifics of diachronical perspective; time dynamics per topics, etc. Political science perspective: political activity of parties and politicians; the role of the various public political bodies; policy comparison; language differences as indicators to differing political views etc. Sociological perspective: conflicts in parliament; attitudes of politicians to critical issue: trending topics; patterns of language use reflecting societal dynamics, models of parliamentary communication, control, commissions, etc. Psychological and language perspective: language portraits of politicians; semantic differences of political terms; gestures; behavior in parliament, etc. Developers' perspective: Design of parliamentary speech corpora: annotations, visualization, etc. Text analytics, semantic processing and linking of parliamentary data Searches and information extraction from parliamentary corpora Multilinguality issues in parliamentary data

19

20 ParlaCLARIN @ LREC 2018 Background Aim Paper submission deadline
Need for better harmonization, interoperability and comparability of the resources and tools relevant for the study of parliamentary discussions and decisions, not only in Europe but worldwide Aim Bring together researchers interested in compiling, annotating, structuring, linking and visualising parliamentary records that are suitable for research in a wide range of disciplines in the Humanities and Social Sciences Paper submission deadline 10 January 2018 More info

21 Veni, vidi, CLARIN! darja.fiser@ff.uni-lj.si


Download ppt "Darja Fišer DARIAH UZH Zurich, 18 December 2017"

Similar presentations


Ads by Google