A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

An Introduction to GATE
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
GATE, Human Language and Machine Learning Hamish Cunningham, Valentin.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
The Semantic Web and Language Technology BT Exact, Martlesham Hamish Cunningham Department of Computer Science, University of Sheffield Friday October.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Snejina Lazarova Senior QA Engineer, Team Lead CRMTeam Dimo Mitev Senior QA Engineer, Team Lead SystemIntegrationTeam Telerik QA Academy SOAP-based Web.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
1(21) HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
XML Primer. 2 History: SGML vs. HTML vs. XML SGML (1960) XML(1996) HTML(1990) XHTML(2000)
Chapter 8: Introduction to High-level Language Programming Invitation to Computer Science, C++ Version, Third Edition.
1 Java Server Pages Can web pages be created specially for each user? What part does Java play?
L EC. 01: J AVA FUNDAMENTALS Fall Java Programming.
UNIT-V The MVC architecture and Struts Framework.
By: Shawn Li. OUTLINE XML Definition HTML vs. XML Advantage of XML Facts Utilization SAX Definition DOM Definition History Comparison between SAX and.
Ontology-Aware Information Extraction Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb.
©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.
Controlled Language for Ontology Editing Adam Funk, Valentin Tablan, Kalina Bontcheva, Hamish Cunningham, Brian Davis, Siegfried Handschuh.
Starting Chapter 4 Starting. 1 Course Outline* Covered in first half until Dr. Li takes over. JAVA and OO: Review what is Object Oriented Programming.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
JSP Standard Tag Library
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
Dynamic Web Pages (Flash, JavaScript)
GATE technical workshop: introduction Hamish Cunningham Sheffield, March.
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Software Architecture for Language Engineering (SALE) – where next? Hamish.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Text.
SednaSpace A software development platform for all delivers SOA and BPM.
NSI 1 Collect Process AnalyseDisseminate Survey A Survey B Historically statistical organisations have produced specialised business processes and IT.
GATE, a General Architecture for Text Engineering Hamish Cunningham Department.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
NOOJ 0.1 Max Silberztein Université de Franche-Comté 6th INTEX Workshop Sofia, Bulgaria, May 2003.
Eric Westfall – Indiana University Jeremy Hanson – Iowa State University Building Applications with the KNS.
An intro to programming. The purpose of writing a program is to solve a problem or take advantage of an opportunity Consists of multiple steps:  Understanding.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
XML and Digital Libraries M. Zubair Department of Computer Science Old Dominion University.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
2XML Marko Tadić Department of linguistics, Faculty of philosophy, University of Zagreb ( Tübingen,
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of.
FlexElink Winter presentation 26 February 2002 Flexible linking (and formatting) management software Hector Sanchez Universitat Jaume I Ing. Informatica.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
introducing the Java Data Processing Framework Paolo Ciccarese, PhD On behalf of the JDPF Team Pavia, December 11, 2007.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Minority Language Engineering Professor Tony McEnery, Dept. Linguistics and Modern English Language, Lancaster University
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
National Taiwan University Department of Computer Science and Information Engineering National Taiwan University Department of Computer Science and Information.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
ECE450 - Software Engineering II1 ECE450 – Software Engineering II Today: Introduction to Software Architecture.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
SOAP-based Web Services Telerik Software Academy Software Quality Assurance.
1 herbert van de sompel CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Yorick Wilks.
JAVA BEANS JSP - Standard Tag Library (JSTL) JAVA Enterprise Edition.
Java – in context Main Features From Sun Microsystems ‘White Paper’
Module Road Map Assignment Road Map Notice we have linked the conduit directly to the presentation layer. This is normally a bad idea!
10 Copyright © 2004, Oracle. All rights reserved. Building ADF View Components.
A Ubiquitous Permeable Web: requirements for the next generation semantic internet Hamish Cunningham Department of Computer Science, University of Sheffield.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Introduction  Model contains different kinds of elements (such as hosts, databases, web servers, applications, etc)  Relations between these elements.
LAMS 2.0 Architecture. LAMS 2.0 Architecture Agenda LAMS 2.0: Technical Aims Architecture Technologies LAMS Core LAMS Tool Contract External Tools.
GATE and the Semantic Web
Dynamic Web Pages (Flash, JavaScript)
Design and Maintenance of Web Applications in J2EE
Chapter 7 –Implementation Issues
Presentation transcript:

A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza, Tony McEnery 1, Paul Baker 1, Mark Leisher 2 Department of Computer Science, University of Sheffield 1 University of Lancaster 2 New Mexico State University GATE (a General Architecture for Text Engineering) and ML LRs 1.Motivation (history of men’s underwear) 2.Short definition of GATE 3.GATE, Unicode and Java 4.EMILLE 1(11)

Motivation for Software Infrastructure for Language Engineering Analogy with recent history of men’s underwear – also supportive infrastructure: The bad old days: Y-fronts: supportive, yes, but tended to be too constrictive The brave new world: boxer shorts: still supportive, but less constraining The purpose of our work (the boxer shorts ideal): freedom within a supportive environment 2(11)

GATE is: An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. Some free components......and wrappers for other people's components Tools for: evaluation; visualisation/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL). Download at 3(11)

Architectural principles Non-prescriptive, theory neutral (strength and weakness) Re-use, interoperation, not reimplementation (e.g. v1 used LT-NSL for SGML input; v2 talks to other XML-based systems, APIs and standards) (Almost) everything is a component, and component sets are user-extendable Component-based development An OO way of chunking software: Java Beans GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 4(11)

GATE Language Resources GATE LRs are documents, ontologies, corpora, lexicons. Documents / corpora: GATE documents loaded from local files or the web... Diverse document formats: text, html, XML, , RTF, SGML. Multilinguality: New internationalised versions of JVM support >100 different encodings. Other encodings: developing system for user-entry of mapping tables. LR persistence through XML, file datastore or databases (Oracle, PostgreSQL). 5(11)

Processing Resourcres Algorithmic components knows as PRs – beans with execute methods. All PRs can handle Unicode data by default. Clear distinction between code and data (simple repurposing) freebies with GATE Unicode Tokeniser splits text into typed tokens based on FSM dynamically constructed from a set of rules based on the character categories defined by the Unicode standard. UPPERCASE_LETTER (LOWERCASE_LETTER|DASH_PUNCTUATION)* > Token;orth=upperInitial;kind=word; output can be localised by a later module (e.g. “don’t” … “do” “n’t”) current status: 23 rules seem able to handle without changes Indo-European languages. the English tokeniser: Unicode tokeniser + pattern grammar FST. 6(11)

Displaying Multilingual Data (1) GATE uses standard (and imperfect) Java rendering engine for displaying text. 7(11)

Displaying Multilingual Data (2) All the visualisation and editing tools for ML LRs use the same facilities: 8(11)

Editing Multilingual Data Java provides no special support for text input (this may change) GATE Unicode Kit (GUK) plugs this hole Support for defining additional Input Methods; currently 30 IMs for 17 languages Pluggable in other applications (e.g. MPI’s EUDICO) Can use virtual keyboard or standard layouts over QWERTY IMs defined in plain text files GUK comes with a standalone Unicode editor 9(11)

EMILLE: Enabling Minority LE 3 year EPSRC project at Lancaster University and Sheffield University. Corpus development: written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. spoken corpora of at least 500,000 words per language. Unicode developments for GATE: Indic keyboard layouts. encodings for Indic languages. Development of basic LE tools: POS tagging. alignment tools for parallel corpora. 10(11)

Encore Other GATE-related stuff at LREC: Saggion et al.: Extraction Information for MM Indexing [Weds, 19.05] Baker et al.: EMILLE [Thurs, 10.25] Demo and poster [Thurs, , session D1] Pastra et al.: Reuse of NE pattern grammars [Thurs, 16.20] Fliers 11(11)