Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from.

Similar presentations


Presentation on theme: "©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from."— Presentation transcript:

1 ©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Paula.Matuszek@gmail.com Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuct ure/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuct ure/Presentation/GATE.ppt

2 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt What is GATE? l Stands for General Architecture for Text Engineering. l Developed at the University of Sheffield l Component-based architecture with data separated from applications, many discrete capabilities included as plugins.

3 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Who Uses GATE? l Scientists performing experiments that involve processing human language l Developers developing applications with language processing components l Teachers and students of courses about language and language computation l Us :-)

4 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt How GATE can Help? l Specify an architecture, or organizational structure, for language processing software l Provide a framework that implements the architecture and can be used to embed language processing capabilities in applications l Provide a development environment built on top of the framework made up of convenient tools for developing components (plugins)

5 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Really? l Yeah, really. l It’s been under development for 15 years and is still under very active development l Open-source, with dozens of developers, some of whom have been involved since the beginning l Active community that provides good support –Mailing list: lists.sourceforge.net/lists/listinfo/gate-users –twitter: twitter.com/#!/GateAcUk –LinkedIn: http://www.linkedin.com/groups/GATE-2230077http://www.linkedin.com/groups/GATE-2230077 l Many other text mining capabilities have been integrated with it. l An almost overwhelming amount of documentation

6 ©2012 Paula Matuszek GATE Architecture Overview http://gate.ac.uk/overview.html

7 ©2012 Paula Matuszek GATE Product Family l GATE Developer: IDE for language processing, with information extraction and other plugins. l GATE Embedded: object library which can be included in applications l GATE Teamware: collaborative annotation environment l GATE Mimir: a “multiparadigm index” which supports semantic indexing and search l GATE Wiki: “controllable wiki” based on Grails and Subversion l GATE Cloud: GATE embedded running on supercomputer hardware

8 ©2012 Paula Matuszek GATE Components l We will deal primarily with GATE Developer: l It has four components: –Applications: groups of processes to be run on a document or corpus. –LanguageResources (LRs): entities such as lexicons, documents, corpora, annotation schemas, ontologies. –ProcessingResources (PRs): tools that operate on unstructured text, such as parsers and tokenizers. These are mostly plugins. –DataStores: saved processed documents and resources.

9 ©2012 Paula Matuszek Overview of Gate Developer l GATE Developer l Resources Pane –applications: groups of processes to run on a document or corpus –language resources: corpus, ontologies, schemas –processing resources: tools that operate on unstructured text –datastores: saved documents and resources l Display Pane: whatever you’re currently working with.

10 ©2012 Paula Matuszek Setup Options l Configuration –Appearance: font, skin –Advanced: –add space on markup (to make html and xml more readable) –Save options and session on exit –Insert append or prepend (for annotations) –default browser (for user guide) l Input (?) –default language

11 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Language Resources l Language Resources can be of four kinds: –Documents are modeled as content plus annotations plus features. –A Corpus is a Java Set whose members are Documents. –Annotations are organized in graphs, which are modeled as Java sets of Annotation. –Schemas are XML schemas describing allowable annotations and features

12 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Documents Processing in GATE l Document: –Formats including XML, RTF, email, HTML, SGML, and plain text. –Identified and converted into GATE annotation format. –Processed by Processing Resources. –Results stored in a serial data store (based on Java serialization) or indexed in a Lucene database. –Can also be exported as XML.

13 ©2012 Paula Matuszek New Document l Documents are converted to GATE format; can be saved for future use or exported. l Language Resources --> New --> Document l Name: can leave blank and it will be created automatically (no spaces) from filename+UniqueID l Checkmarks: required. –just leave defaults –sourceURL – can be a file (click the folder icon for browse) –or actual URL (GATE will fetch it) –or set to stringContent to put content in directly. l Encoding will probably be utf-8. l markupAware: process XML and HTML tags

14 ©2012 Paula Matuszek Document Display l Double-click document –Text (minus annotations if you chose markupAware) –Annotation Sets –from XML, HTML, previous annotation work –different colors for different categories –Annotations list –annotations chosen in Sets pane

15 ©2012 Paula Matuszek Creating a Corpus l To import new documents we name the corpus and create it without any documents. l Language Resources --> New --> Corpus l Right-click and populate –choose directory, extensions, encoding This will create the corpus and show the corpus and the individual documents in the Resources Pane.

16 ©2012 Paula Matuszek GATE Corpus l Corpus Display Pane: –Add documents to a corpus with + button which appears when a corpus is displayed. –Remove with -. (Note: this removes them from corpus, not from Developer) l Documents can be included in multiple corpora. l A corpus can be created from a single concatenated file, by specifying the documentRootElement. This makes sense for, for instance, XML documents.

17 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt CREOLE l A Collection of REusable Objects for Language Engineering l The set of resources integrated with GATE l All the resources are packaged as Java Archive (or ‘JAR’) files, plus some XML configuration data. l Managed in the Creole Plugin Manager

18 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Processing Resources: ANNIE l A family of Processing Resources for language analysis included with GATE l Stands for A Nearly-New Information Extraction system. l Using finite state techniques to implement various tasks: tokenization, semantic tagging, verb phrase chunking, and so on.

19 ©2012 Paula Matuszek ANNIE IE Modules http://gate.ac.uk/sale/tao/splitch6.html#http://gate.ac.uk/sale/tao/splitch6.html#chap:annie

20 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Some ANNIE Components l Tokenizer l Gazetteer: lists of entities l Sentence Splitter l Part of Speech Tagger –produces a part-of-speech tag as an annotation on each word or symbol. l Semantic Tagger

21 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt ANNIE Component: Tokenizer l Token Types –word, number, symbol, punctuation, and spaceToken. l A tokenizer rule has a left hand side and a right hand side.

22 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Tokenizer Rule l Operations used on the LHS: – | (or) – * (0 or more occurrences) – ? (0 or 1 occurrences) – + (1 or more occurrences) l The RHS uses ’;’ as a separator, and has the following format: {LHS} > {Annotation type};{attribute1}={value 1};...;{attribute n}={value n}

23 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Example Tokenizer Rule –"UPPERCASE_LETTER" "LOWERCASE_LET TER"* –> –Token;orth=upperInitial;kind=word; –The sequence must begin with an uppercase letter, followed by zero or more lowercase letters. This sequence will then be annotated as type “Token”. The attribute “orth” (orthography) has the value “upperInitial”; the attribute “kind” has the value “word”.

24 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt ANNIE Component: Gazetteer l The gazetteer lists used are plain text files, with one entry per line. l Each list represents a set of names, such as names of cities, organizations, days of the week, etc.

25 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Example Gazetteer List l A small section of the list for units of currency: l …… l Ecu European Currency Units FFr Fr German mark German marks New Taiwan dollar New Taiwan dollars NT dollar NT dollars l ……

26 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt ANNIE Component: Semantic Tagger l Based on JAPE language, which contains rules that act on annotations assigned in earlier phases. l Produce outputs of annotated entities.

27 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt ANNIE Component: Sentence Splitter l Segments the text into sentences. l This module is required for the tagger. l The splitter uses a gazetteer list of abbreviations to help distinguish sentence-marking full stops from other kinds.

28 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Example Using ANNIE l http://services.gate.ac.uk/annie/ http://services.gate.ac.uk/annie/ l More next week.

29 ©2012 Paula Matuszek Viewing and Editing Annotations l We have looked at annotations, both added by ANNIE and extracted from tags in the document. l It is sometimes useful to examine closely and edit these annotations –you are using a small corpus and want them correct before you proceed with other tools –you have a sample set that will be used for training or for quality assurance and they need to be accurate –you are still developing the resources being used to tag documents.

30 ©2012 Paula Matuszek Unrestricted Annotation Editing l We can change to an arbitrary different annotation type. l The process is: –choose text to be annotated –hover over it or right click. The annotation editor pops up. –if you’re changing it, delete existing annotation –add new annotation, by choosing or typing it in

31 ©2012 Paula Matuszek Restricted Annotation Editing l Typically we want better consistency and control for our editing. l Use a schema to specify allowable annotation types and features. l GATE includes many predefined schemas l Located at /plugins/ANNIE/resources/schem a

32 ©2012 Paula Matuszek Schema Annotation Editor l CREOLE resource to let us use the schema for annotation editing l Enable in Manage CREOLE Plugins window (under File menu) l Select an annotation, hover or right-click l Different editor window, specifying allowable types and features l Choose new type or feature.

33 ©2012 Paula Matuszek More on Schemas and Editing l You can also initiate editing by right- clicking on an annotation in the annotations list. l You can use multiple schemata in processing one document.

34 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Create an Application with Processing Resources (PRs) l Applications model a control strategy for the execution of PRs. l Simple pipelines: group a set of PRs together in order and execute them in turn. l Corpus pipelines: open each document in the corpus in turn, set that document as a runtime parameter on each PR, run all the PRs on the corpus, then close the document l We will do this during lab.

35 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Saving GATE Language Resources and Applications l Data Stores: –save processed documents for additional use –specialized folder on a hard drive –Lucene database –improve processing times for large collections of documents

36 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Types of Data Store l Serial Data Store: –based on java’s serialization system. –store in a directory l Lucene Data Store (Lucene is an open- source indexing and search tool.) –searchable repository –Lucene-based indexing

37 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Saving in a datastore l Create a folder. l Right-click to get Create Datastore menu l This only creates the store. Save corpora or documents in the Language Resources pane. l Once saved, they can be

38 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Saving as XML l Individual documents can also be saved directly. –Special GATE XML format –annotations are appended to the document, locations for tags are embedded in body –Preserve original format –use for XML or html. –will save all original tags and everything selected in the annotations –For a plain text file, embeds inline tags.

39 ©2012 Paula Matuszek Taken partially from a presentation by Lin Lin. http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt http://iwayan.info/Research/Interoperability/Tutor_Workshop/AmitShethGlobalInfInfrastuctur e/Presentation/GATE.ppt Saving Applications l Save a set of processing resources and their parameters. –Right-click, save application state. –Append.xgapp for name l To export as a standalone, export as teamware –bundles all needed files –intended for teamware but can be used for sharing directly.

40 ©2012 Paula Matuszek And LOTS more l GATE is an extraordinarily rich system. Some of the other CREOLE resources included in the standard distribution: –Annotation Merging, Quality assurance summarizer for comparing annotations –Web crawler, Information Retrieval, Key Phrase Extraction –Machine learning –Domain-specific taggers (e.g., chemistry) –Resources for many languages l CREOLE plugins for integrating with many other systems. E.g. –UIMA –Wordnet –Penn BioTagger –OpenCalais –OpenNLP –LingPipe l More details at http://gate.ac.uk/gate/doc/plugins.htmlhttp://gate.ac.uk/gate/doc/plugins.html

41 ©2012 Paula Matuszek Some Links l Home page is http://gate.ac.uk/http://gate.ac.uk/ l Some good short tutorial videos for getting started: http://gate.ac.uk/demos/developer-videos/. These are only a few minutes each, so they’re fast. Version 6, but they don’t seem to be very different. http://gate.ac.uk/demos/developer-videos/ l User Guide: http://gate.ac.uk/sale/tao/index.html. This is apparently for version 7.1, which is a development build, but again it seems to be fine.http://gate.ac.uk/sale/tao/index.html l Lots of documentation (“acres” of it): http://gate.ac.uk/documentation.html http://gate.ac.uk/documentation.html l The wiki: http://gate.ac.uk/wiki/http://gate.ac.uk/wiki/ l Some very nice course materials, with a lot more detail than we will cover, including a unit on sentiment analysis: http://gate.ac.uk/wiki/training-materials-2011.html http://gate.ac.uk/wiki/training-materials-2011.html

42 ©2012 Paula Matuszek What Next? l In lab we will create a simple application and use it. l Next week we will go into a lot more detail on using Annie for information extraction l Homework. (You knew that was coming...) l I’m not going to get into programming in GATE or the more advanced applications. This might be the best tool for some of your projects, though.


Download ppt "©2012 Paula Matuszek CSC 9010: Text Mining Applications Fall, 2012 Introduction to GATE Dr. Paula Matuszek Taken partially from."

Similar presentations


Ads by Google