The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.

Slides:

Advertisements

Similar presentations

Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,

Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd

Data Mining and Text Analytics GATE, by Joel Bywater.

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.

Providing collections, tools and services for digital humanities A national library perspective Clément Oury Head of Digital Legal Deposit Bibliothèque.

WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.

1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.

1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 

L EARNERS ’ D ICTIONARY Deny A. Kwary

Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.

Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

FLAX Weaving with Oxford Open Educational Resources Alannah Fitzgerald

Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.

True IT Solutions For You 1 IT Solutions Software Development and Web Design.

Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.

Oxford Scholarship Online Lund Online, 18 th October 2007, Lund Sweden Adina Teusan, OUP Online Sales Executive – Scandinavia.

Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.

1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.

Research Methods & Data AD140Brendan Rapple 2 March, 2005.

Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.

Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.

1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.

INTERNET CHAPTER 12 Information Available The INTERNET contains a huge amount of information a huge amount of information information on any topic you.

1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.

Search Engines Meta Engines People Directories Subject Directories Domains explained URLs explained Hypertext Language Contents.

Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.

GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.

1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.

1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.

Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.

Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

ECM in SaaS mode to maximize information sharing in a globalized world Business case: bwin.

Internet tool to find answers to poorly defined questions SmartNet © ITC Software,

Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.

COMM 170 USING THE WEB. WHEN I NEED TO FIND RESOURCES FOR AN ASSIGNMENT THE FIRST THING I DO IS... 1.Go to the Web & Google it 2.Ask my friends on Facebook.

Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.

Changing the way the world learns English 1. Intellectual leadership A few years from now, anyone wanting to know about teaching or learning English.

© Cambridge University Press 2013 Thomson_alphaem.

© Cambridge University Press 2013 Thomson_Fig

arTenTen A new, vast corpus for Arabic

Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.

Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.

Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,

CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.

© copyright 2014 Semantic Insights™ “A New Natural Language Understanding Technology for Research of Large Information Corpora." By Chuck Rehberg, CTO.

Scientific information  Resources  Search & Access  Evaluation  Intellectual Property  E- learning E.Kvasnak, 3 rd Medical Faculty, Charles University.

英 3B 戴偲婷. WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt)

Future Web Trends Brian Kelly UK Web Focus UKOLN University of Bath UKOLN is funded by Resource: The Council for Museums, Archives.

Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.

Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.

© Cambridge University Press 2013 Thomson_Fig

What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.

Databases. Databases – objective Define database Show differences between databases and search engine Show pathfinder Demonstrate databases Search databases.

© Cambridge University Press 2011

Making useful wordlists for ELT

Building Search Systems for Digital Library Collections

Handling Data Using Databases

using the internet for research

Thomson_eeWWtgc © Cambridge University Press 2013.

Thomson_atlascmsEventsAlt

A Latin corpus for Sketch Engine

Thomson_CandP © Cambridge University Press 2013.

Simulation And Modeling

Corpora, Language Technology and Maltese

Thomson_AFBCartoon © Cambridge University Press 2013.

Presentation transcript:

The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK

Requirements  Fast  Stable and reliable  Handle collections of any size  Even billions of words  Support complex markup  Wide range of query-types, reports  Live on the web  With access management

Requirements  One infrastructure, many resources  Ten-year-plus timescale  With long term:  Support and maintenance  Ongoing development  Engagement with resource development  University research projects not designed that way  Commercial: advantages

Everything or just text

or

You can’t please all the people all the time

Everything or just text  Vast  Indexing – how?  what search terms?  Solve the world  Small  Indexing  Easy  Divide and rule

Sketch Engine  Text only  Meets all criteria  Ten years  Users  Dictionary-making  Oxford Univ Press, Cambridge Univ Press, Collins, Macmillan, le Robert, Cornelsen  INL and eight other national research institutes  Universities  Research, teaching, language teaching

Linguistics  Text database = corpus (pl: corpora)

Languages  Around sixty  Main world languages:  “tenten” corpora, order of 10b words  Web scale

Where now  Core technology  In place  Front end for linguists  In place  Front end for other humanities scholars  Good prospect  Links to other resources  Preliminary work with British Library  Proposals welcome

Thank you