Download presentation
Presentation is loading. Please wait.
1
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht Application and “Workspaces” Erhard Hinrichs & Thomas Zastrow University Tübingen
2
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Outline Web-based Linguistic Chaining Tool (WebLicht) for incremental filtering and access of language corpus data WebLicht – Motivation WebLicht - Architecture WebLicht – Future Requirements Test Case – Gutenberg Corpus
3
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org CLARIN Mission CLARIN (Common Language Resource and Technology Infrastructure Network) is committed to establishing an integrated and interoperable RI supporting easy access and use of language aims to overcome the current fragmentation and offer a stable, persistent and extendable infrastructure it will offer its services to researchers and scholars across a wide spectrum of domains in particular in the humanities and soc sciences ESFRI roadmap project; implementation phase starts in 2011
4
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Typical CLARIN user scenario Scenario: A PhD student investigates regional differences in vocabulary and in word collocations in different variants of German. Data: large text corpora available at BBAW in Berlin, at the Austrian Academy of Science in Vienna, the Swiss Text Corpus Project in Basel, and at EURAC, Bolzano. Tools for targeted data access: WebLicht offers customizable chains of web services for filtering and analyzing the data
5
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Motivation Many linguistic resources (corpora, dictionaries, …) and tools (tokenizer, tagger, parser, …) are available Most of them are implemented to run on local machines. This can be inconvenient and error-prone Requirements: go beyond “do-it-yourself” and “download- first” strategies The CLARIN solution: Make tools and resources available as webservices
6
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Architecture WebLicht is a SOA for accessing and processing text corpora Development started in October 2008 WebLicht consists of the following components: Distributed services: offering functionality (resources & tools) over the (inter-)net. Implemented as webservices (ca. 90 at the moment) Repository: stores metadata and technical information about the services Web 2.0 based user interface: interacts with the user and combines services and information from the repository. Access still possible via scripts / programming code
7
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Architecture Web 2.0 Application for Tool Chaining and Execution Repository Stuttgart Tübingen BerlinLeipzigFinland Standard-conformant Text Corpus Encoding StuttgartTübingenLeipzig RomaniaIceland UK
8
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Architecture Services are implemented as REST style webservices HTTPs POST method is used to send data from the UI to the services As client, anything which is able to use the HTTP protocol, can be used: Browser Commandline tools (wget, curl) Programming Languages Anyone can implement his/her own interface to WebLicht
9
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Processing Chains
10
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Results
11
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Results
12
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Features With RESTstyle webservices, everyone can implement a web service for WebLicht (4pages tutorial) The SOA infrastructure is independent of programming languages or operating systems The chaining algorithm is independent of the used dataformat Form a legal point of view, the web services are still located in the institute where they were created
13
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Future Requirements Web services are synchronous: some linguistic annotation processes are very time consuming an asynchronous behavior of these service would be desirable The processing power is limited by local computing resources Scalability only with strong centers possible The current architecture is not sufficiently parallelized and therefore does not scale up: Accommodate a large number of simultaneous users Parallelization of processes
14
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht – Future Requirements Currently, users have to store the input data and their results on their local machines Online storage in the form of personal workspaces with reliable backup solutions Linguistic tools are typically developed in a variety of heterogeneous software environments and programming languages (Java, Perl, Python, C/C++, Prolog, Lisp, …) Encapsulation of individual services with common APIs for interoperability Currently, WebLicht services are limited to processing text corpora Extending webservices also to spoken language and multi- modal datasets (MPI is already working on this)
15
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Test Case: Gutenberg Corpus On the basis of these structure, a part of the free available Gutenberg Project was annotated in Tübingen Ca. 20.000 texts from 800 authors Runtime: ca. 3.5 weeks Result: 217 million tokens (words), 533 million constituents, 110 GB data
16
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Gutenberg Corpus – Analyzing Fulltext index (Lucene) Database for the linear part of the data Tree-like structures can be analyzed with XML based techniques (Xpath, Xquery) DOM based techniques are slow and performance hungry
17
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org Links etc. Clarin Homepage: http://www.clarin.eu The D-Spin homepage: http://www.d-spin.org WebLicht (login via DFN AAI): https://weblicht.sfs.uni- tuebingen.de/ Erhard Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft Universität Tübingen Wilhelmstr. 19 D-72074 Tübingen thomas.zastrow@uni-tuebingen.de Erhard.hinrichs@uni-tuebingen.de
18
WebLicht Application and Workspaces Munich September 2010 www.d-spin.org WebLicht - Combinations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.