WebBootCaT usage 2010-2013 Adam Kilgarriff Lexical Computing Ltd.

Slides:



Advertisements
Similar presentations
The creation of "Yaolan.com" A Site for Pre-natal and Parenting Education in Chinese by James Caldwell DAE Interactive Marketing a Web Connection Company.
Advertisements

Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
By Pamela McMahon. Find space on the internet In order to build a website, you must have somewhere to build it. You can buy space and customize it anyway.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
BOTWU BootCaTters of the world unite! Erika Dalan (University of Bologna)
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Google App Engine Cloud B. Ramamurthy 7/11/2014CSE651, B. Ramamurthy1.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
JSP: JavaServer Pages Juan Cruz Kevin Hessels Ian Moon.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Chinese Information Processing (I): Basic Concepts and Practice Unit 7: Web Pages in Chinese.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
SEO Lunch How to Grow A Business in 3 Bites Akiva Ben-Ezra
Digital/physical content store. Summary Create a digital content/physical product web store based on osCommerce. Following items can be sold in the store:
Browser Comparisons - Convenience Internet Explorer 8 & 9, Chrome 11 and Firefox 4 Searching, Convenience & Add-ons.
The ECHA-term project Multilingual REACH and CLP Terminology Dieter Rummel, Translation Centre for the Bodies of the EU Luxembourg EAFT - Oslo, 11 October.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
Promoting Open Source Software Through Cloud Deployment: Library à la Carte, Heroku, and OSU Michael B. Klein Digital Applications Librarian
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
1 Session 1: Introduction to HTML Spring Today’s Agenda Cover useful terminology for today’s session HTML, browsers, servers, etc. HTML Tags Get.
CSCI-235 Micro-Computer in Science Internet Search.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Module 10 Administering and Configuring SharePoint Search.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Internet Applied Dayton Metro Library Place photo here June 2, 2016.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Database VS. Search Engine Explore the difference between database* and search results Next.
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Module 8 : Configuration II Jong S. Bok
CPT 499 Internet Skills for Educators Session Three Class Notes.
Multilingual prototype GCMD Portal JAXA/EORC Kengo Aizawa KEIO UNIVERSITY Hiromichi Fukui Kazuyoshi Kunisawa March 8, 2005.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
 Goals:  Decrease number of search results  Increase number of relevant results  Method:  Use any of several search tips and commands  Search engines.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
What part of the URL tell the computer to find the server?
Microsoft Windows 7 - Illustrated Unit G: Exploring the Internet with Microsoft Internet Explorer.
Week 5  SEO  CSS Please Visit: to download all the PowerPoint Slides for.
Introduction to HTML 4.0 Getting Started – Basic Terminology Teacher: Mr. Ho.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Chapter 10: Web Basics.
Chapter 10: Web Basics.
Making useful wordlists for ELT
Internet LINGO.
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
All About the Internet.
Internet Basics and Information Literacy
Internet and the world wide web (www)
Presentation transcript:

WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd

History BootCat publication 2004 Exciting but ▫Classes of students with no unix skills ▫permissions ▫  Sketch Engine: already running web service so ▫2006: WebBootCaT ▫All on our server ▫load corpora into Sketch Engine BootCaT Front End (2011?)

WBC usage ,199 runs to build 8,832 corpora ▫Ave: 1.38 iterations per corpus ▫User selected keywords to iterate 673 times Users: ▫1131 people used it once ▫1590 people: 2-10 times ▫177 people: times ▫18 people: over 50 times Sizes of corpora (in words) ▫Still-existing corpora only  Under 25k: 663  k: 945  100k-1m: 889  Over 1m: 33 NB ▫a paying service ▫default quota is 1m  pay more for more

BootCaT Front End Stats from Eros Zanchetta Including Bologna Excluding Bologna Total number of known BootCaT installations (since August 5, 2011) Number of times each instance was used Zipfian distribution BootCaT installations used at least once since January 1,

Search engines Achilles heel of BootCaT WBC ▫Was Yahoo  Changes to API   Costs  ▫2011 Change to Bing  Free up to 5000 queries / month  We make /month  We pay a few Euros a month for up to 10,000

How big a corpus do we get?

Observation Specialist domain, L1 Specialist domain, L2 Matching terminology 7

Going multilingual Translate seeds ▫English: volcanology volcanologist "volcanic eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic ▫French : vulcanologue volcanologie "éruption volcaniq ue" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologie stratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques BootCaT for English BootCaT for French

CCBC Input: L1, L1 seeds, L2 Bilingual dictionary Bootcat 2 corpora Bilingual word sketches 10

11

Matching seeds – how? User translates ▫Yes but limited Bilingual dictionary ▫Yes but finding them?? ▫Induced dictionary from EUROPARL Wikipedia ▫Matching articles Measuring comparability ▫Li and Gaussier, Serge

Corpus Architect Part of SkE web service Building/managing corpora ▫WBC is one way of adding text ▫Others  Upload from your computer  Point to specified URLs  (recent request: whole site) ▫One corpus can be multiple data sets ▫Other services  Cleaning, de-duping, lemmatising, tagging + explore in SkE

Survey 41 people ▫Original command line8 ▫Bologna Front End 16 ▫WebBootCaT 27 ▫Other1 How often? ▫Once a week or more 2 ▫Most months 7 ▫Occasionally 32 What for? ▫Academic research 33 ▫Translation work 5 ▫Tr teaching/learning 8 ▫Lg teaching/learning 9 Size ▫< 100 pages 13 ▫ (ca 1m wds) 18 ▫Bigger 11 Iterations etc ▫Basic, defaults 8 ▫One round change params 15 ▫Iterations 22

Suggestions/comments Some seeds wds: not possible to get corpus Sources’ reliability needs to be improved Less important now there is spiderling Webinars please Better support for languages/character-encoding ▫Japanese, Greek Apply over large static collection: replicablity

Suggestions/comments Some seed wds: not possible to get corpus Sources’ reliability needs to be improved Less important now there is spiderling Webinars please Better support for languages/character-encoding ▫Japanese, Greek (3/12 comments) Apply over large static collection: replicability More data with more relevant content please