Corpora Linguistics 23.08.041 The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.

Slides:



Advertisements
Similar presentations
TrAva – a tool for evaluating Machine Translation – pedagogical and research possibilities Belinda Maia, Diana Santos, Luís Sarmento & Anabela Barreiro.
Advertisements

1 Copyright © 2002 Pearson Education, Inc.. 2 Chapter 1 Introduction to Perl and CGI.
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
CONCEPTUAL WEB-BASED FRAMEWORK IN AN INTERACTIVE VIRTUAL ENVIRONMENT FOR DISTANCE LEARNING Amal Oraifige, Graham Oakes, Anthony Felton, David Heesom, Kevin.
GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
MY NCBI To register, add filters and use the MY NCBI options, you should directly access PubMed using the following address:
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
Contents The Gentt Group The concept of text genre as the core of the project Research objectives Methodology Phases of the Gentt Project Main results.
Mitsunori Ogihara Center for Computational Science
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Information Society Technologies Third Call for Proposals Norbert Brinkhoff-Button DG Information Society European Commission Key action III: Multmedia.
Integrating translation technology at undergraduate level Belinda Maia University of Porto.
Introduction to Computational Linguistics
]po[ Docu Wiki.  ]project-opem[ 2008, Rollout Methodology / Frank Bergmann / 2 Types of Readers  Beginners – These users have just started using ]po[.
Information and Communication Technologies 1 Working with Portuguese corpora Diana Santos Linguateca
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
MLIF: A Metamodel to Represent and Exchange Multilingual Textual Information ISO TC37 SC4 WG Samuel Cruz-Lara, Gil Francopoulo, Laurent Romary,
An Integration Platform of Social Networking Applications to Support Life Long Learning in Rural Territories: the “SoRuraLL Virtual Learning World” Environment.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Corpora and the ‘general public’ Belinda Maia and Luís Sarmento Universidade do Porto.
ABRAPT Mini-curso The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
REACTION REACTION Workshop Overview Lisbon, PT and Austin, TX Mário J. Silva University of Lisbon, Portugal.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Chapter 5 Application Software.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Lecturer: Ghadah Aldehim
Computer Skills Preparatory Year Presented by: L.Obead Alhadreti.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
© Paradigm Publishing, Inc. 5-1 Chapter 5 Application Software Chapter 5 Application Software.
Tech Tools to Support Literacy in the Content Area ATEN Region 2 July 2005 July 2005.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
1 DATABASES By: Hanna Ben-Or Phone: October 2011.
Leonardo da Vinci BeLT - Blended Learning Transfer On line collaboration Previous experience and proposal Confindustria Veneto SIAV SpA Mestre, Fabruary.
Instant Messaging for the Workplace A pure collaborative communication tool that does not distract users from their normal activities.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Sofia Garcia/Roberto Silva Tutorial Workshop, GrenobleDate: 31/Jan/2007 The work of a professional translator and the translation agency V1.0.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
© Paradigm Publishing Inc. 5-1 Chapter 5 Application Software.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Food and Agriculture Organization of the UN Library and Documentation Systems Division Margherita Sini July 2005 Managing domain ontologies within the.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Federal Department of Home Affairs FDHA Federal Statistical Office FSO Storytelling in times of tablets Armin Grossenbacher November 2014.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Access : connection to the Internet account : an arrangement you have with a company or Internet provider to use a service they provide. browse : to look.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis The 17th International.
Terminology Extraction Tool (Auto/Semi-Auto)
UNIT 15 Webpage Creator.
European Network of e-Lexicography
Sharing of Eurostat predefined tables
Information Technology Ms. Abeer Helwa
Information support for the researcher through the Infoportal
Unit# 5: Internet and Worldwide Web
Sharing of Eurostat predefined tables
Presentation transcript:

Corpora Linguistics The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

Corpora Linguistics A bit of history PALC ’97 – 'Do-it- yourself corpora... with a little bit of help from your friends!' CULT ‘Making corpora – a learning process’  Contrastive linguistics  Corpora linguistics  Translation teaching  General > specific language

Corpora Linguistics A bit of history 2000 – 1 st Master’s in Terminology and Translation at FLUP PALC ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’  Specialized translation and terminology  Contact with domain experts  Importance of IT  Need for technical help for more ambitious students!

Corpora Linguistics A bit of history LREC ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ 2002 – 2 nd Master’s in Terminology and Translation at FLUP  Plea for help to Diana Santos  October 2002  LINGUATECA - Polo FLUP

Corpora Linguistics LINGUATECA See Leader > Diana Santos (SINTEF – Oslo) Objective - to create resources and tools for the computational processing of Portuguese Nodes at Oslo, Lisbon, Braga and Porto Porto - Polo CLUP/FLUP

Corpora Linguistics Polo CLUP/FLUP General focus See On constructing resources specific to the needs of FLUP/CLUP –For researchers, teachers and students –For teaching methodology at FLUP  BNC & Reuter’s corpora on intranet  A small ‘chat’ corpus  Comparable corpora

Corpora Linguistics More history 2003 – Poster of the GC – at CL – ‘What are comparable corpora?’ CL – Experimentation with evaluation of Machine Translation 2003 – Experimentation with GC 2003 – 3 rd Master’s in Terminology and Translation at FLUP

Corpora Linguistics Polo CLUP/FLUP Research focus See On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research –Focus on special domains –Construction of terminology databases, ontologies and domain models  Corpógrafo

Corpora Linguistics And... Evaluation of Machine Translation –Experimentation with evaluation –Teaching + research focus –Tools for collecting empirical data Results: –TrAva – MT evaluation tool –CorTA – Corpus of 1 EN input + 4 MT output sentences

Corpora Linguistics The Corpógrafo results from: Terminology, translation and language study and research (Belinda) Computational linguistics research and production of resources (Diana) Information retrieval and artificial intelligence (Luís) Terminology data (Domain experts) = Discussions on priorities!

Corpora Linguistics GC – Integrated Web Environment for Corpora Linguistics Motivation Lack of Comprehensive, wide-scope Corpora Tools Commercial Packages are usually difficult to Integrate/Customize Tools are not prepared to support cooperative work. Linguistic knowledge is not usually integrated in tools. What is GC? GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: access several Corpora tools from a single entry point using a regular web browser access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) communicate and exchange results with other users Internet Integration GC provides seamless integration with the World Wide Web allowing users to: search specific Corpora resources on the Internet query the web for concordances use available translation-engines in parallel. DOC HTML TXT PS PDF RTF BNC CETEM Público COMPARA Others Personal Corpora Custom Interface DEV Inter-user Communication ADM USER Administrator’s Tasks: Users, Groups and Disk Quotas Corpora Taxonomy (see box) Documentation Organization Access Service Statistics Virtual Desktop Custom Interface Tool Pool Concordance Engine Taggers Aligner (Semi-Auto) Corpora Bot Statistics Custom Tools Internet Terminology DB Medium: written, spoken, multimedia Domain: Engineering, medicine, etc. Genre: scientific, technical, informative, etc. Corpora Taxonomy Terminology Extraction Tool (Auto/Semi-Auto) Developer Task: Developer’s Tasks: Integrate Existing Tools/Resources Develop Additional Generic Tools Interact with Users/Administrator Develop Custom Tools for particular research needs Inter-User Communication Tagging and Aligning Cooperatively Messaging Service Exchange of Corpora Resources Provide on-line tutorials Provide links to: on-line teaching material bibliography and other resources Teacher’s Tasks:

Corpora Linguistics Working with the Corpógrafo Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research All research done ONLINE Each username/password = separate space on our server At present > anyone can work with it using 10 MB space for FREE BUT - you get an empty space + tools + tutorial!

Corpora Linguistics Corpora and Terminology Special Domain Corpora Terminology extraction Terminology databases Structuring of domain knowledge Further corpora and information retrieval

Corpora Linguistics Corpora Analysis Terminology Database Internet Text details

Corpora Linguistics Terminology Prescription or Description? Prescriptive > descriptive Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic

Corpora Linguistics Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering  Standardized terminology  The ‘right word’  Finding information  Perfecting Google  Structuring knowledge  Finding it fast

Corpora Linguistics Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship

Corpora Linguistics Focus of Corpógrafo Design priorities are to: –See the Big Picture –Create the Overall Framework –Get feedback from users –Develop according to real research needs –Fill in details and improve techniques as needed

Corpora Linguistics

Corpora Linguistics File Manager Area where each individual or group can: –Upload texts to space on server –Convert various text formats to.txt –‘Clean’ them of unnecessary material –Check tokenization and sentence divisions –Register full information on source, domain and text type –Group – and re-group - texts into corpora

Corpora Linguistics General corpus analysis Concordancing tools allowing for –Concordancing at sentence level –KWIC concordancing –Collocations N-gram tool –Case-sensitive –Alphabetical or frequency ordering

Corpora Linguistics Corpora + TDB Choose corpus Choose related TDB = All terms, examples, definitions extracted (semi) automatically from corpus and transferred to TDB = All metadata on texts providing data can be automatically transferred to TDB

Corpora Linguistics Term extraction N-grams –Unfiltered –Filtered with restrictions on term in PT EN FR IT ES DE –Filtered with restrictions on term and context in PT EN FR IT ES DE –Singular + plural terms can be combined –Existing terms in TDB need not appear

Corpora Linguistics Term selection from n-grams Consultation of list of n-grams Check term status of each n-gram via underlying concordances Check sources Send to TDB

Corpora Linguistics Search for Candidates for Definitions and/or Semantic Relations Already possible via TDB Under development Research areas for Mestrado dissertations and research assistants –Expressions that find definitions –Expressions that find semantic relations

Corpora Linguistics TDB - Terminology database Databases are designed to be multilingual –Terms listed alphabetically + language tag –General data –Morphological data –Source metadata: Authors, texts etc –Definitions + search for candidates –Translation equivalents –Semantic relations

Corpora Linguistics Future developments General testing and improvement Development of new ideas or functions Isomorphic relationship between: –Research possibilities –Researchers’ needs –Our skills Coordination of individual corpus projects into bigger projects, when possible or necessary

Corpora Linguistics Theoretical questions / problems How large is a good domain corpus? Comparable corpora v. Parallel corpora? How much information does a database need – for information retrieval and knowledge engineering? How much does the user of a database need – for translation, teaching etc.?

Corpora Linguistics Corpógrafo and special domains Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Technology – GPS – Geographical Positioning Systems

Corpora Linguistics Corpógrafo and terminology/translation research Ongoing dissertations on aspects of: –Terminology – neologisms, definition searches, semantic relations, conceptual analysis –Corpora – text analysis, corpora construction –Technical writing > Electrical Appliances –Localization –Terminology in documentaries –Translation of Multimedia

Corpora Linguistics Linguateca Linguateca’s policy - all resources and tools freely available online Primary users - Portuguese and Brazilian Other users also welcome

Corpora Linguistics Polo CLUP/FLUP Bi- or multi-lingual in interest Corpógrafo available for experiments on a small scale to the general public Possibilities of future work on projects with users from other universities and other countries

Corpora Linguistics Corpógrafo team Belinda Maia - FLUP -Associate Professor Luís Sarmento - Linguateca, FCCN – Computer Engineer - Researcher-in-charge Luís Miguel Cabral - Linguateca, FCCN – Computer Engineer, Research assistant Débora Oliveira - Linguateca, FCCN – Research assistant Ana Sofia Pinto – FLUP – technical assistant

Corpora Linguistics Contacts If you are interested is finding out more, please contact me: Belinda Maia at Or Luís Sarmento at The Corpógrafo can be used (with a username and password) at: and

Corpora Linguistics