Corpora Linguistics The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA
Corpora Linguistics A bit of history PALC ’97 – 'Do-it- yourself corpora... with a little bit of help from your friends!' CULT ‘Making corpora – a learning process’ Contrastive linguistics Corpora linguistics Translation teaching General > specific language
Corpora Linguistics A bit of history 2000 – 1 st Master’s in Terminology and Translation at FLUP PALC ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’ Specialized translation and terminology Contact with domain experts Importance of IT Need for technical help for more ambitious students!
Corpora Linguistics A bit of history LREC ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ 2002 – 2 nd Master’s in Terminology and Translation at FLUP Plea for help to Diana Santos October 2002 LINGUATECA - Polo FLUP
Corpora Linguistics LINGUATECA See Leader > Diana Santos (SINTEF – Oslo) Objective - to create resources and tools for the computational processing of Portuguese Nodes at Oslo, Lisbon, Braga and Porto Porto - Polo CLUP/FLUP
Corpora Linguistics Polo CLUP/FLUP General focus See On constructing resources specific to the needs of FLUP/CLUP –For researchers, teachers and students –For teaching methodology at FLUP BNC & Reuter’s corpora on intranet A small ‘chat’ corpus Comparable corpora
Corpora Linguistics More history 2003 – Poster of the GC – at CL – ‘What are comparable corpora?’ CL – Experimentation with evaluation of Machine Translation 2003 – Experimentation with GC 2003 – 3 rd Master’s in Terminology and Translation at FLUP
Corpora Linguistics Polo CLUP/FLUP Research focus See On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research –Focus on special domains –Construction of terminology databases, ontologies and domain models Corpógrafo
Corpora Linguistics And... Evaluation of Machine Translation –Experimentation with evaluation –Teaching + research focus –Tools for collecting empirical data Results: –TrAva – MT evaluation tool –CorTA – Corpus of 1 EN input + 4 MT output sentences
Corpora Linguistics The Corpógrafo results from: Terminology, translation and language study and research (Belinda) Computational linguistics research and production of resources (Diana) Information retrieval and artificial intelligence (Luís) Terminology data (Domain experts) = Discussions on priorities!
Corpora Linguistics GC – Integrated Web Environment for Corpora Linguistics Motivation Lack of Comprehensive, wide-scope Corpora Tools Commercial Packages are usually difficult to Integrate/Customize Tools are not prepared to support cooperative work. Linguistic knowledge is not usually integrated in tools. What is GC? GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: access several Corpora tools from a single entry point using a regular web browser access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) communicate and exchange results with other users Internet Integration GC provides seamless integration with the World Wide Web allowing users to: search specific Corpora resources on the Internet query the web for concordances use available translation-engines in parallel. DOC HTML TXT PS PDF RTF BNC CETEM Público COMPARA Others Personal Corpora Custom Interface DEV Inter-user Communication ADM USER Administrator’s Tasks: Users, Groups and Disk Quotas Corpora Taxonomy (see box) Documentation Organization Access Service Statistics Virtual Desktop Custom Interface Tool Pool Concordance Engine Taggers Aligner (Semi-Auto) Corpora Bot Statistics Custom Tools Internet Terminology DB Medium: written, spoken, multimedia Domain: Engineering, medicine, etc. Genre: scientific, technical, informative, etc. Corpora Taxonomy Terminology Extraction Tool (Auto/Semi-Auto) Developer Task: Developer’s Tasks: Integrate Existing Tools/Resources Develop Additional Generic Tools Interact with Users/Administrator Develop Custom Tools for particular research needs Inter-User Communication Tagging and Aligning Cooperatively Messaging Service Exchange of Corpora Resources Provide on-line tutorials Provide links to: on-line teaching material bibliography and other resources Teacher’s Tasks:
Corpora Linguistics Working with the Corpógrafo Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research All research done ONLINE Each username/password = separate space on our server At present > anyone can work with it using 10 MB space for FREE BUT - you get an empty space + tools + tutorial!
Corpora Linguistics Corpora and Terminology Special Domain Corpora Terminology extraction Terminology databases Structuring of domain knowledge Further corpora and information retrieval
Corpora Linguistics Corpora Analysis Terminology Database Internet Text details
Corpora Linguistics Terminology Prescription or Description? Prescriptive > descriptive Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic
Corpora Linguistics Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering Standardized terminology The ‘right word’ Finding information Perfecting Google Structuring knowledge Finding it fast
Corpora Linguistics Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship
Corpora Linguistics Focus of Corpógrafo Design priorities are to: –See the Big Picture –Create the Overall Framework –Get feedback from users –Develop according to real research needs –Fill in details and improve techniques as needed
Corpora Linguistics
Corpora Linguistics File Manager Area where each individual or group can: –Upload texts to space on server –Convert various text formats to.txt –‘Clean’ them of unnecessary material –Check tokenization and sentence divisions –Register full information on source, domain and text type –Group – and re-group - texts into corpora
Corpora Linguistics General corpus analysis Concordancing tools allowing for –Concordancing at sentence level –KWIC concordancing –Collocations N-gram tool –Case-sensitive –Alphabetical or frequency ordering
Corpora Linguistics Corpora + TDB Choose corpus Choose related TDB = All terms, examples, definitions extracted (semi) automatically from corpus and transferred to TDB = All metadata on texts providing data can be automatically transferred to TDB
Corpora Linguistics Term extraction N-grams –Unfiltered –Filtered with restrictions on term in PT EN FR IT ES DE –Filtered with restrictions on term and context in PT EN FR IT ES DE –Singular + plural terms can be combined –Existing terms in TDB need not appear
Corpora Linguistics Term selection from n-grams Consultation of list of n-grams Check term status of each n-gram via underlying concordances Check sources Send to TDB
Corpora Linguistics Search for Candidates for Definitions and/or Semantic Relations Already possible via TDB Under development Research areas for Mestrado dissertations and research assistants –Expressions that find definitions –Expressions that find semantic relations
Corpora Linguistics TDB - Terminology database Databases are designed to be multilingual –Terms listed alphabetically + language tag –General data –Morphological data –Source metadata: Authors, texts etc –Definitions + search for candidates –Translation equivalents –Semantic relations
Corpora Linguistics Future developments General testing and improvement Development of new ideas or functions Isomorphic relationship between: –Research possibilities –Researchers’ needs –Our skills Coordination of individual corpus projects into bigger projects, when possible or necessary
Corpora Linguistics Theoretical questions / problems How large is a good domain corpus? Comparable corpora v. Parallel corpora? How much information does a database need – for information retrieval and knowledge engineering? How much does the user of a database need – for translation, teaching etc.?
Corpora Linguistics Corpógrafo and special domains Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Technology – GPS – Geographical Positioning Systems
Corpora Linguistics Corpógrafo and terminology/translation research Ongoing dissertations on aspects of: –Terminology – neologisms, definition searches, semantic relations, conceptual analysis –Corpora – text analysis, corpora construction –Technical writing > Electrical Appliances –Localization –Terminology in documentaries –Translation of Multimedia
Corpora Linguistics Linguateca Linguateca’s policy - all resources and tools freely available online Primary users - Portuguese and Brazilian Other users also welcome
Corpora Linguistics Polo CLUP/FLUP Bi- or multi-lingual in interest Corpógrafo available for experiments on a small scale to the general public Possibilities of future work on projects with users from other universities and other countries
Corpora Linguistics Corpógrafo team Belinda Maia - FLUP -Associate Professor Luís Sarmento - Linguateca, FCCN – Computer Engineer - Researcher-in-charge Luís Miguel Cabral - Linguateca, FCCN – Computer Engineer, Research assistant Débora Oliveira - Linguateca, FCCN – Research assistant Ana Sofia Pinto – FLUP – technical assistant
Corpora Linguistics Contacts If you are interested is finding out more, please contact me: Belinda Maia at Or Luís Sarmento at The Corpógrafo can be used (with a username and password) at: and
Corpora Linguistics