The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies weissermar@gmail.com.

Slides:



Advertisements
Similar presentations
Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
Advertisements

Part Two: Using Xaira to explore corpora Richard Xiao
MP IP Strategy Stateye-GUI Provided by Edotronik Munich, May 05, 2006.
Guide to Oracle10G1 Introduction To Forms Builder Chapter 5.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Tutorial 11: Connecting to External Data
Tutorial 3: Adding and Formatting Text. 2 Objectives Session 3.1 Type text into a page Copy text from a document and paste it into a page Check for spelling.
Filters using Regular Expressions grep: Searching a Pattern.
COMPREHENSIVE Excel Tutorial 8 Developing an Excel Application.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
NetTech Solutions Working with Web Elements Lesson 6.
Chapter Four UNIX File Processing. 2 Lesson A Extracting Information from Files.
 A database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. What is Database?
Pragmatic Annotation & Analysis in DART Martin Weisser School of English & Education Guangdong University of Foreign Studies martinweisser.org.
Session 1 SESSION 1 Working with Dreamweaver 8.0.
Dreamweaver MX. 2 Overview of Templates n Templates represent a web page design or _______ that will be common to multiple pages. n There are two situations.
Javadoc: Advanced Features & Limitations Presented By: Wes Toland.
Ali Alshowaish. What is HTML? HTML stands for Hyper Text Markup Language Specifically created to make World Wide Web pages Web authoring software language.
“This presentation is for informational purposes only and may not be incorporated into a contract or agreement.”
Annotating the HKCSE Pragmatically Martin Weisser Visiting Professor School of English and Education Guangdong University of Foreign Studies mail:
FIX Eye FIX Eye Getting started: The guide EPAM Systems B2BITS.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
英 3B 戴偲婷. WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt)
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Ergo User Tutorial - Part 3 NCSA, UIUC.
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
DYNAMIC HTML What is Dynamic HTML: HTML code that allow you to change/ specify the style of your web pages. Example: specify style sheet, object model.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Ergo User Tutorial - Part 3 NCSA, UIUC.
Test Automation For Web-Based Applications Portnov Computer School Presenter: Ellie Skobel.
Hyperion Artifact Life Cycle Management Agenda  Overview  Demo  Tips & Tricks  Takeaways  Queries.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
Tutorial 9 Working with XHTML. New Perspectives on HTML, XHTML, and XML, Comprehensive, 3rd Edition 2 Objectives Describe the history and theory of XHTML.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Getting Started With HTML
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Product Training Program
Sales presentation.
HTML CS 4640 Programming Languages for Web Applications
Excel Tutorial 8 Developing an Excel Application
IST 220 – Intro to Databases
Introduction to OBIEE:
Bare boned notes.
CARA 3.10 Major New Features
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Bare bones notes.
ASP.NET Web Controls.
AntConc is a freeware, multiplatform of application suitable for all types of users
Take Time to Obey the Rules
CONTENT MANAGEMENT SYSTEM CSIR-NISCAIR, New Delhi
Metadata Editor Introduction
Introduction to XHTML.
Microsoft Office Illustrated
Building and Using Queries
ORACLE SQL Developer & SQLPLUS Statements
Introducing HTML & XHTML:
Translation Workspace File Filters
Chapter Four UNIX File Processing.
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Using Cascading Style Sheets (CSS)
Introduction To ArcMap
Structuring Content in a Web Document
Optional Assembler Features 2
Using Templates and Library Items
Using GOLD to Tracking L2 Development
Part 1. Preparing for the exercises
Tutorial 7 – Integrating Access With the Web and With Other Programs
CSE591: Data Mining by H. Liu
4.02A HTML Overview 4.02 Develop web pages using various layouts and technologies. (Note to instructor: HTML tags are in red only to differentiate from.
Unit J: Creating a Database
HTML CS 4640 Programming Languages for Web Applications
HTML5 and CSS3 Illustrated Unit B: Getting Started with HTML
Presentation transcript:

The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies weissermar@gmail.com

Outline Genesis of the Tool Feature Overview Illustration of Individual Features Annotation Concordancing N-gram Analysis Feature Extraction

Genesis of the Tool 2001 –2002: SPAAC (A Speech-Act Annotated Corpus of Dialogues) Project semi-automated annotation of 1,200+ transactional dialogues majority of data ‘unpublishable’, due to restrictions imposed by BT 2013 release of SPAADIA corpus (version 1) user query about best viewing option SPAADIA Concordancer further development into Simple Corpus Tool, including extended options for analysis & feature extraction annotation v. 1 released Oct 2013 current version 1.5

Feature Overview (1) corpus editing & analysis tool includes: annotation editor concordancer n-gram analysis feature counting flexible & configurable options supports full Perl regular expressions

Feature Overview (2) Feature counting options/definitions Concordancer; results hyperlinked to editor N-gram analysis tool corpus files editable Extension filter Input files workspace

Annotation (1) editor linked to various analysis features cyclical refinement of annotations convenient extraction of annotated features file encoding assumed to be UTF-8 (e.g. allows insertion of phonetic characters) XML/pseudo SGML annotation for XML & text files annotation resources fully configurable containing elements (block & inline) empty elements optional default attributes categorised cascading menus for values colour-coding for tags

Annotation (2) containing elements empty elements attributes values (sub-categorised) attributes colour coding: syntactic class empty elements

Concordancing (1) line-based concordancer assumes that main structural units & text are separate context set to n lines before or after concordancing on tags or textual content (2 potential search terms) displays dispersion full Perl regex support option for storing commonly used regexes SPAADIA/DART features colour coding pre-defined unit tags and speech-act attributes hits hyperlinked to editor for adding annotations modifying existing annotations

Concordancing (2) search term 1 search term 2 dispersion context settings hyperlink to editor hits

N-gram Analysis (1) hyperlinked to concordancer include relative frequencies & dispersion ‘optimised’ for spoken language: option for excluding fillers re-interpolating into concordances efficient regex filtering

N-gram Analysis (2) case handling output filter sorting options customisable exclusion options for producing cleaned n-grams; can be re-interpolated into concordancer n-gram length relative frequencies & dispersion n-gram counter hyperlinked n-grams; prime concordancer

Feature Extraction (1) basic feature: word count per file can be filtered annotations automatically removed exceptions (e.g. anonymised names) can be specified advanced ‘feature label :: pattern’ pairings ad hoc definitions in ‘Feature definitions’ window can be loaded from & saved to files built-in regex pattern evaluation & error reporting convenient ‘export’ to Excel/Calc for further analysis (e.g. frequency norming)

Feature Extraction (2) feature counts per file feature labels → column headings feature counts per file file names → row headings feature definition patterns

Future Extensions concordancing on text within specified tags n-gram list comparison collocations? exposing more customisation options user requests