Darja Fišer DARIAH Day @ UZH Zurich, 18 December 2017 Veni, vidi, CLARIN! Darja Fišer DARIAH Day @ UZH Zurich, 18 December 2017 CC-BY 4.0
Overview Intro to CLARIN CLARIN data architecture CLARIN for data science
Intro to CLARIN
CLARIN in seven bullets CLARIN is the Common Language Resources and Technology Infrastructure ESFRI ERIC status since 2012, Landmark since 2016 that provides easy and sustainable access for scholars in the humanities and social sciences and beyond to digital language data (in written, spoken, video or multimodal form) and advanced tools to discover, explore, exploit, annotate, analyse or combine them, wherever they are located through a single sign-on environment and that serves as an ecosystem for knowledge sharing.
CLARIN ERIC in members and centres A consortium of: 19 members: AT, BG, CZ, DE, DK, DLU, EE, FI, GR, HU, IT, LT, LV, NL, NO, PL, PT, SE, SI 2 observers: FR, UK; >40 centres
What CLARIN Centres offer Repository library of linguistic data and tools search for data and tools and easily use them online or download them deposit your data and be sure it is safely stored, everyone can find it, and correctly cite it Federated single sign-on log in once with your existing institutional credentials get access to protected resources Metadata describe content, provenance and formats of linguistic data and tools facilitate preservation and dissemination of linguistic data and tools Persistent Identifier (PID or handle) a special permanent URL that provides a permanent link to linguistic data and tools will resolve correctly even if in some distant future the data is moved should be used as URL in citations Licensing Public Academic Restricted Preservation (Data Seal of Approval) committed to long-term care of items in the repository ensure the archived data can be found, understood and used in the future
CLARIN data types and user communities Newspaper archives Literary texts Parliamentary records Historical letters Broadcast archives Oral History data Social Media data … Digital humanities Linguistics and Philology Translation and Lexicography Literary Studies History Political and Social Sciences Media Studies Culture, Folklore, Anthropology Speech therapy Teachers General Public
CLARIN data architecture
Repositories * slides by Dieter Van Uytvanck
Harvesting
Processing
Content search
Workflows
CLARIN for data science
CLARIN and data science (1) Text and speech as social and cultural data Contribution to the development of new methodological frameworks for the integrated processing of multiple datatypes, and multidisciplinary research agendas Europe’s multilinguality as a basis for comparative research of societal and cultural phenomena, that are reflected in language use: Migration patterns Intellectual history Language variation across period and region Dynamics in mental health conditions Parliamentary discourse
Parliamentary records great potential for reuse and re-purposing within many fields of study in the humanities and social sciences (and beyond): suited for both close reading and distance reading Humanities: history, language change, discourse analysis … Social sciences: social and cultural dynamics, political sciences, economics ... considered a rich data type apart from linguistic content, rich in metadata (speaker, party affiliation, age, sex, education, origin, duration of speech) apart from linguistic content, rich in extralinguistic clues (interruptions, voting results) made easily available under the Freedom of Information acts in over 100 countries all around the world to enable informed participation by the public and improve effective functioning of democratic systems but also often presenting itself as messy or noisy data calling for links with data in other modalities than text and speech created under specific circumstances that need to be well understood before strong conclusions can be drawn
Corpora of parliamentary records Coverage exist for 18 countries Size (in tokens) largest: UK (1.6 billion) smallest: Portuguese (1 million) Periods covered by the corpus mostly 2nd half of 20th century and 21st century, Dutch and British corpora from early 19th century Availability For download (7) at, cz [CPM], dk, de [sample only], no [ToN], pt, lv For on-line searching (7) Finnish (KORP) CzechParl (SketchEngine) Latvian (noSketchEngine) Bulgarian (CLaRK) Hungarian (HNC, registration required) Proceedings of Norwegian Parliamentary Debates (Corpuscle) Both for download and on-line searching (5) Dutch (Political Mashup) Estonian (Keeleveeb) Swedish (KORP) Slovenian (noSketchEngine) Polish (NKJP) Full overview available here
CLARIN’s Parliamentary data for many disciplines Perspective of curators and researchers: Historical perspective: the specifics of diachronical perspective; time dynamics per topics, etc. Political science perspective: political activity of parties and politicians; the role of the various public political bodies; policy comparison; language differences as indicators to differing political views etc. Sociological perspective: conflicts in parliament; attitudes of politicians to critical issue: trending topics; patterns of language use reflecting societal dynamics, models of parliamentary communication, control, commissions, etc. Psychological and language perspective: language portraits of politicians; semantic differences of political terms; gestures; behavior in parliament, etc. Developers' perspective: Design of parliamentary speech corpora: annotations, visualization, etc. Text analytics, semantic processing and linking of parliamentary data Searches and information extraction from parliamentary corpora Multilinguality issues in parliamentary data
ParlaCLARIN @ LREC 2018 Background Aim Paper submission deadline Need for better harmonization, interoperability and comparability of the resources and tools relevant for the study of parliamentary discussions and decisions, not only in Europe but worldwide Aim Bring together researchers interested in compiling, annotating, structuring, linking and visualising parliamentary records that are suitable for research in a wide range of disciplines in the Humanities and Social Sciences Paper submission deadline 10 January 2018 More info https://www.clarin.eu/ParlaCLARIN
Veni, vidi, CLARIN! darja.fiser@ff.uni-lj.si