©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.

Slides:



Advertisements
Similar presentations
In the Format section, we have activated the Bibliographic style drop down menu. From this page, you can choose a specific journal or format (e.g. BMC.
Advertisements

Microsoft ® Office OneNote ® 2007 Training Using your Notebook to its fullest potential Kent School District presents:
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
An Introduction to GATE
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
 Use the Left and Right arrow keys or the Page Up and Page Down keys to move between the pages. You can also click on the pages to move forward.  To.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
MODULE 4 File and Folder Management. Creating file and folder A computer file is a resource for storing information, which is available to a computer.
Information Retrieval in Practice
1 of 5 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
1 CA201 Word Application Creating Document for the Web Week # 9 By Tariq Ibn Aziz Dammam Community college.
1 of 6 Parts of Your Notebook Below is a graphic overview of the different parts of a OneNote 2007 notebook. Microsoft ® OneNote ® 2007 notebooks are digital.
Tutorial 8 Sharing, Integrating and Analyzing Data
Developing a Basic Web Page with HTML
1 Chapter 20 — Creating Web Projects Microsoft Visual Basic.NET, Introduction to Programming.
Overview of Search Engines
With Alex Conger – President of Webmajik.com FrontPage 2002 Level I (Intro & Training) FrontPage 2002 Level I (Intro & Training)
Microsoft Office Word 2013 Expert Microsoft Office Word 2013 Expert Courseware # 3251 Lesson 4: Working with Forms.
Working with SharePoint Document Libraries. What are document libraries? Document libraries are collections of files that you can share with team members.
PowerPoint Lesson 4 Expanding on PowerPoint Basics
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 3 Windows File Management 1 Morrison / Wells / Ruffolo.
A Guide to SQL, Eighth Edition Chapter Three Creating Tables.
Microsoft Word 2010 Lesson 10. Learning Objectives 1 Understand and Use Mail Merge 2 Select and Edit a main document 3 Create a source document 4 Preview,
Using Eclipse. What is Eclipse? The Eclipse Platform is an open source IDE (Integrated Development Environment), created by IBM for developing Java programs.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Survey of Semantic Annotation Platforms
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
1. Chapter 9 Maintaining Documents 3 Managing Files As with physical documents, folders, and filing cabinets, electronic files and folders must be well.
Moodle (Course Management Systems). Assignments 1 Assignments are a refreshingly simple method for collecting student work. They are a simple and flexible.
Session 1 SESSION 1 Working with Dreamweaver 8.0.
Computing Fundamentals Module Lesson 3 — Changing Settings and Customizing the Desktop Computer Literacy BASICS.
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
INTRODUCTORY Tutorial 1 Using HTML Tags to Create Web Pages.
University of Sheffield NLP Module 1: Introduction to GATE Developer © The University of Sheffield, This work is licenced under the Creative.
Microsoft Office Outlook 2013 Microsoft Office Outlook 2013 Courseware # 3252 Lesson 6: Organizing Information.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
1 ADVANCED MICROSOFT WORD Lesson 14 – Editing in Workgroups Microsoft Office 2003: Advanced.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Introduction to Morpho BEAM Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Liferay Installation Prepared by: Do Xuan Hai 8 August 2011.
COMPUTER APPLICATIONS COURSE LEARN HOW TO USE COMPUTERS.
UoS Libraries 2011 EndNote X5 - basic graduate session.
Introduction to Morpho RCN Workshop Samantha Romanello Long Term Ecological Research University of New Mexico.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Introduction to KE EMu
Getting Started with Word & Saving Guided Lesson.
Session 2: Basic HTML HTML Coding Spring 2009 The LIS Web Team Presents.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 3 Windows File Management 1 Morrison / Wells / Ruffolo.
Creating and Editing a Web Page
HTML HYPER TEXT MARKUP LANGUAGE. INTRODUCTION Normal text” surrounded by bracketed tags that tell browsers how to display web pages Pages end with “.htm”
+ Publishing Your First Post USING WORDPRESS. + A CMS (content management system) is an application that allows you to publish, edit, modify, organize,
©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,
1 Copyright © 2014 Tata Consultancy Services Limited Assessment Knowledge Center – Item Creation Training Document.
PowerPoint Basics Tutorial 2: A Slide Show In this tutorial you’re going to create a presentation from scratch. You will have to keep this presentation.
Chapter 29. Copyright 2003, Paradigm Publishing Inc. CHAPTER 29 BACKNEXTEND 29-2 LINKS TO OBJECTIVES Attach an XML Schema Attach an XML Schema Load XML.
Word and the Writing Process. To create a document 1.On the Start menu, point to Programs, and then click Microsoft Word. A new document opens in Normal.
Resources in Moodle Dubravka Crnić. Moodle supports a range of resource types which teachers can add to their courses. In edit mode, a teacher can add.
1 PROJECT 3 WEB/HTML PROJECT USING NOTEPAD Management Information Systems, 9 th edition, By Raymond McLeod, Jr. and George P. Schell © 2004, Prentice Hall,
Windows Vista Configuration MCTS : Internet Explorer 7.0.
Tutorial 1 Getting Started with Adobe Dreamweaver CS5.
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 3 Windows File Management 1 Morrison / Wells / Ruffolo.
Computing Fundamentals
IBM Rational Rhapsody Advanced Systems Training v7.5
Introduction to XHTML.
Microsoft Word Reviewing Documents.
Module 1: Introduction to GATE Developer
Presentation transcript:

©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from a presentation by Lin Lin. ure/Presentation/GATE.ppt ure/Presentation/GATE.ppt

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt What is GATE? l Stands for General Architecture for Text Engineering. l Developed at the University of Sheffield l Component-based architecture with data separated from applications, many discrete capabilities included as plugins.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Who Uses GATE? l Scientists performing experiments that involve processing human language l Developers developing applications with language processing components l Teachers and students of courses about language and language computation l Us :-)

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt How GATE can Help? l Specify an architecture, or organizational structure, for language processing software l Provide a framework that implements the architecture and can be used to embed language processing capabilities in applications l Provide a development environment built on top of the framework made up of convenient tools for developing components (plugins)

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Really? l Yeah, really. l It’s been under development for 15 years and is still under very active development l Open-source, with dozens of developers, some of whom have been involved since the beginning l Active community that provides good support –Mailing list: lists.sourceforge.net/lists/listinfo/gate-users –twitter: twitter.com/#!/GateAcUk –LinkedIn: l Many other text mining capabilities have been integrated with it. l An almost overwhelming amount of documentation

©2012 Paula Matuszek GATE Architecture Overview

©2012 Paula Matuszek GATE Product Family l GATE Developer: IDE for language processing, with information extraction and other plugins. l GATE Embedded: object library which can be included in applications l GATE Teamware: collaborative annotation environment l GATE Mimir: a “multiparadigm index” which supports semantic indexing and search l GATE Wiki: “controllable wiki” based on Grails and Subversion l GATE Cloud: GATE embedded running on supercomputer hardware

©2012 Paula Matuszek GATE Components l We will deal primarily with GATE Developer: l It has four components: –Applications: groups of processes to be run on a document or corpus. –LanguageResources (LRs): entities such as lexicons, documents, corpora, annotation schemas, ontologies. –ProcessingResources (PRs): tools that operate on unstructured text, such as parsers and tokenizers. These are mostly plugins. –DataStores: saved processed documents and resources.

©2012 Paula Matuszek Overview of Gate Developer l GATE Developer l Resources Pane –applications: groups of processes to run on a document or corpus –language resources: corpus, ontologies, schemas –processing resources: tools that operate on unstructured text –datastores: saved documents and resources l Display Pane: whatever you’re currently working with.

©2012 Paula Matuszek Setup Options l Configuration –Appearance: font, skin –Advanced: –add space on markup (to make html and xml more readable) –Save options and session on exit –Insert append or prepend (for annotations) –default browser (for user guide) l Input (?) –default language

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Language Resources l Language Resources can be of four kinds: –Documents are modeled as content plus annotations plus features. –A Corpus is a Java Set whose members are Documents. –Annotations are organized in graphs, which are modeled as Java sets of Annotation. –Schemas are XML schemas describing allowable annotations and features

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Documents Processing in GATE l Document: –Formats including XML, RTF, , HTML, SGML, and plain text. –Identified and converted into GATE annotation format. –Processed by Processing Resources. –Results stored in a serial data store (based on Java serialization) or indexed in a Lucene database. –Can also be exported as XML.

©2012 Paula Matuszek New Document l Documents are converted to GATE format; can be saved for future use or exported. l Language Resources --> New --> Document l Name: can leave blank and it will be created automatically (no spaces) from filename+UniqueID l Checkmarks: required. –just leave defaults –sourceURL – can be a file (click the folder icon for browse) –or actual URL (GATE will fetch it) –or set to stringContent to put content in directly. l Encoding will probably be utf-8. l markupAware: process XML and HTML tags

©2012 Paula Matuszek Document Display l Double-click document –Text (minus annotations if you chose markupAware) –Annotation Sets –from XML, HTML, previous annotation work –different colors for different categories –Annotations list –annotations chosen in Sets pane

©2012 Paula Matuszek Creating a Corpus l To import new documents we name the corpus and create it without any documents. l Language Resources --> New --> Corpus l Right-click and populate –choose directory, extensions, encoding This will create the corpus and show the corpus and the individual documents in the Resources Pane.

©2012 Paula Matuszek GATE Corpus l Corpus Display Pane: –Add documents to a corpus with + button which appears when a corpus is displayed. –Remove with -. (Note: this removes them from corpus, not from Developer) l Documents can be included in multiple corpora. l A corpus can be created from a single concatenated file, by specifying the documentRootElement. This makes sense for, for instance, XML documents.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt CREOLE l A Collection of REusable Objects for Language Engineering l The set of resources integrated with GATE l All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. l Managed in the Creole Plugin Manager

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Processing Resources: ANNIE l A family of Processing Resources for language analysis included with GATE l Stands for A Nearly-New Information Extraction system. l Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.

©2012 Paula Matuszek ANNIE IE Modules

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Some ANNIE Components l Tokenizer l Gazetteer: lists of entities l Sentence Splitter l Part of Speech Tagger –produces a part-of-speech tag as an annotation on each word or symbol. l Semantic Tagger

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt ANNIE Component: Tokenizer l Token Types –word, number, symbol, punctuation, and spaceToken. l A tokenizer rule has a left hand side and a right hand side.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Tokenizer Rule l Operations used on the LHS: – | (or) – * (0 or more occurrences) – ? (0 or 1 occurrences) – + (1 or more occurrences) l The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value 1};...;{attribute n}={value n}

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Example Tokenizer Rule –"UPPERCASE_LETTER" "LOWERCASE_LET TER"* –> –Token;orth=upperInitial;kind=word; –The sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt ANNIE Component: Gazetteer l The gazetteer lists used are plain text files, with one entry per line. l Each list represents a set of names, such as names of cities, organizations, days of the week, etc.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Example Gazetteer List l A small section of the list for units of currency: l …… l Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars l ……

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt ANNIE Component: Semantic Tagger l Based on JAPE language, which contains rules that act on annotations assigned in earlier phases. l Produce outputs of annotated entities.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt ANNIE Component: Sentence Splitter l Segments the text into sentences. l This module is required for the tagger. l The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Example Using ANNIE l l More next week.

©2012 Paula Matuszek Viewing and Editing Annotations l We have looked at annotations, both added by ANNIE and extracted from tags in the document. l It is sometimes useful to examine closely and edit these annotations –you are using a small corpus and want them correct before you proceed with other tools –you have a sample set that will be used for training or for quality assurance and they need to be accurate –you are still developing the resources being used to tag documents.

©2012 Paula Matuszek Unrestricted Annotation Editing l We can change to an arbitrary different annotation type. l The process is: –choose text to be annotated –hover over it or right click. The annotation editor pops up. –if you’re changing it, delete existing annotation –add new annotation, by choosing or typing it in

©2012 Paula Matuszek Restricted Annotation Editing l Typically we want better consistency and control for our editing. l Use a schema to specify allowable annotation types and features. l GATE includes many predefined schemas l Located at /plugins/ANNIE/resources/schem a

©2012 Paula Matuszek Schema Annotation Editor l CREOLE resource to let us use the schema for annotation editing l Enable in Manage CREOLE Plugins window (under File menu) l Select an annotation, hover or right-click l Different editor window, specifying allowable types and features l Choose new type or feature.

©2012 Paula Matuszek More on Schemas and Editing l You can also initiate editing by right- clicking on an annotation in the annotations list. l You can use multiple schemata in processing one document.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Create an Application with Processing Resources (PRs) l Applications model a control strategy for the execution of PRs. l Simple pipelines: group a set of PRs together in order and execute them in turn. l Corpus pipelines: open each document in the corpus in turn, set that document as a runtime parameter on each PR, run all the PRs on the corpus, then close the document l We will do this during lab.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Saving GATE Language Resources and Applications l Data Stores: –save processed documents for additional use –specialized folder on a hard drive –Lucene database –improve processing times for large collections of documents

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Types of Data Store l Serial Data Store: –based on java’s serialization system. –store in a directory l Lucene Data Store (Lucene is an open- source indexing and search tool.) –searchable repository –Lucene-based indexing

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Saving in a datastore l Create a folder. l Right-click to get Create Datastore menu l This only creates the store. Save corpora or documents in the Language Resources pane. l Once saved, they can be

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Saving as XML l Individual documents can also be saved directly. –Special GATE XML format –annotations are appended to the document, locations for tags are embedded in body –Preserve original format –use for XML or html. –will save all original tags and everything selected in the annotations –For a plain text file, embeds inline tags.

©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. e/Presentation/GATE.ppt e/Presentation/GATE.ppt Saving Applications l Save a set of processing resources and their parameters. –Right-click, save application state. –Append.xgapp for name l To export as a standalone, export as teamware –bundles all needed files –intended for teamware but can be used for sharing directly.

©2012 Paula Matuszek And LOTS more l GATE is an extraordinarily rich system. Some of the other CREOLE resources included in the standard distribution: –Annotation Merging, Quality assurance summarizer for comparing annotations –Web crawler, Information Retrieval, Key Phrase Extraction –Machine learning –Domain-specific taggers (e.g., chemistry) –Resources for many languages l CREOLE plugins for integrating with many other systems. E.g. –UIMA –Wordnet –Penn BioTagger –OpenCalais –OpenNLP –LingPipe l More details at

©2012 Paula Matuszek Some Links l Home page is l Some good short tutorial videos for getting started: These are only a few minutes each, so they’re fast. Version 6, but they don’t seem to be very different. l User Guide: This is apparently for version 7.1, which is a development build, but again it seems to be fine. l Lots of documentation (“acres” of it): l The wiki: l Some very nice course materials, with a lot more detail than we will cover, including a unit on sentiment analysis:

©2012 Paula Matuszek What Next? l In lab we will create a simple application and use it. l Next week we will go into a lot more detail on using Annie for information extraction l Homework. (You knew that was coming...) l I’m not going to get into programming in GATE or the more advanced applications. This might be the best tool for some of your projects, though.