Download presentation
Presentation is loading. Please wait.
1
ABRAPT Mini-curso 30.08.04 The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA
2
ABRAPT Mini-curso 30.08.04 A bit of history PALC ’97 – 'Do-it- yourself corpora... with a little bit of help from your friends!' CULT 1998 - ‘Making corpora – a learning process’ Contrastive linguistics Corpora linguistics Translation teaching General > specific language
3
ABRAPT Mini-curso 30.08.04 A bit of history 2000 – First Master’s in Terminology and Translation at FLUP PALC 2001 - ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’ Specialized translation and terminology Contact with domain experts Importance of IT Need for technical help for more ambitious students!
4
ABRAPT Mini-curso 30.08.04 A bit of history LREC 2002 - ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ 2002 – Second Master’s in Terminology and Translation at FLUP Plea for help to Diana Santos October 2002 LINGUATECA - Polo FLUP
5
ABRAPT Mini-curso 30.08.04 LINGUATECA See http://www.linguateca.pthttp://www.linguateca.pt Leader > Diana Santos (SINTEF – Oslo) Objective - to create resources and tools for the computational processing of Portuguese Poles at Oslo, Lisbon, Braga and Porto Porto – Polo CLUP/FLUP
6
ABRAPT Mini-curso 30.08.04 Polo CLUP/FLUP See http://www.linguateca.pt/poloclup/http://www.linguateca.pt/poloclup/ On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research –Focus on special domains –Construction of terminology databases, ontologies and domain models Corpógrafo
7
ABRAPT Mini-curso 30.08.04 Polo CLUP/FLUP See http://www.linguateca.pt/poloclup/http://www.linguateca.pt/poloclup/ General help in constructing resources specific to the need of FLUP/CLUP –For researchers, teachers and students –For teaching methodology at FLUP BNC & Reuter’s corpora on intranet A small ‘chat’ corpus
8
ABRAPT Mini-curso 30.08.04 More history 2003 – Poster of the GC – at CL2003 2003 – ‘What are comparable corpora?’ CL2003 2003 – Experimentation with evaluation of Machine Translation 2003 – Experimentation with GC 2003 – Third Master’s in Terminology and Translation at FLUP
9
ABRAPT Mini-curso 30.08.04 GC – Integrated Web Environment for Corpora Linguistics Motivation Lack of Comprehensive, wide-scope Corpora Tools Commercial Packages are usually difficult to Integrate/Customize Tools are not prepared to support cooperative work. Linguistic knowledge is not usually integrated in tools. What is GC? GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: access several Corpora tools from a single entry point using a regular web browser access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) communicate and exchange results with other users Internet Integration GC provides seamless integration with the World Wide Web allowing users to: search specific Corpora resources on the Internet query the web for concordances use available translation-engines in parallel. DOC HTML TXT PS PDF RTF BNC CETEM Público COMPARA Others Personal Corpora Custom Interface DEV Inter-user Communication ADM USER Administrator’s Tasks: Users, Groups and Disk Quotas Corpora Taxonomy (see box) Documentation Organization Access Service Statistics Virtual Desktop Custom Interface Tool Pool Concordance Engine Taggers Aligner (Semi-Auto) Corpora Bot Statistics Custom Tools Internet Terminology DB Medium: written, spoken, multimedia Domain: Engineering, medicine, etc. Genre: scientific, technical, informative, etc. Corpora Taxonomy Terminology Extraction Tool (Auto/Semi-Auto) Developer Task: Developer’s Tasks: Integrate Existing Tools/Resources Develop Additional Generic Tools Interact with Users/Administrator Develop Custom Tools for particular research needs Inter-User Communication Tagging and Aligning Cooperatively Messaging Service Exchange of Corpora Resources Provide on-line tutorials Provide links to: on-line teaching material bibliography and other resources Teacher’s Tasks:
10
ABRAPT Mini-curso 30.08.04 And then... PoloCLUP’s 3 rd function: Evaluation of Machine Translation –Experimentation with evaluation –Teaching + research focus Results: –TrAva – MT evaluation tool –CorTA – Corpus of 1 EN input + 4 MT output sentences
11
ABRAPT Mini-curso 30.08.04 Prescriptive v descriptive terminology Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….
12
ABRAPT Mini-curso 30.08.04 Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering Standardized terminology Getting the right word Finding information Perfecting Google Structuring knowledge Finding it fast
13
ABRAPT Mini-curso 30.08.04 Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship
14
ABRAPT Mini-curso 30.08.04 The Corpógrafo combines: Terminology, translation and language study and research (Belinda) Terminology databases (Domain experts) Computational linguistics research and production of resources (Diana) Information retrieval and artificial intelligence (Luís) = Discussions on priorities!
15
ABRAPT Mini-curso 30.08.04 Corpora and Terminology Corpora as input Terminology extraction Terminology databases Structuring of domain knowledge Further corpora
16
ABRAPT Mini-curso 30.08.04 Corpora Analysis Terminology Database Internet Text details
17
ABRAPT Mini-curso 30.08.04 Working with the Corpógrafo Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research All research done ONLINE Each username/password = separate space on our server At present > anyone can work with it using 10 MB space for FREE BUT - you get an empty space + tools + tutorial!
18
ABRAPT Mini-curso 30.08.04 Terminology old v new Prescriptive > descriptive Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….
19
ABRAPT Mini-curso 30.08.04 Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering Standardized terminology Getting the right word Finding information Perfecting Google Structuring knowledge Finding it fast
20
ABRAPT Mini-curso 30.08.04 Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship
21
ABRAPT Mini-curso 30.08.04 Focus of Corpógrafo Design priorities are to: –See the Big Picture –Create the Overall Framework –Get feedback from users to see their needs –Develop according to real research needs –Fill in the details and improve techniques as needed
22
ABRAPT Mini-curso 30.08.04 Corpógrafo and special domains Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Technology – GPS – Geographical Positioning Systems
23
ABRAPT Mini-curso 30.08.04 Corpógrafo and terminology/translation research Ongoing dissertations on aspects of: –Terminology – databases for different uses, neologisms, definition searches, semantic relations, conceptual analysis –Corpora – text analysis, corpora construction –Technical writing > Electrical Appliances –Localization –Terminology in documentaries –Translation of Multimedia
24
ABRAPT Mini-curso 30.08.04 Linguateca Linguateca’s policy - all resources and tools freely available online Primary users - Portuguese and Brazilian
25
ABRAPT Mini-curso 30.08.04 Polo CLUP/FLUP Bi- or multi-lingual in interest Corpógrafo available for experiments on a small scale to the general public Possibilities of future work on projects with users from other universities and other countries
26
ABRAPT Mini-curso 30.08.04 Contacts If you are interested is finding out more, please contact me: Belinda Maia bmaia@mail.telepac.pt The Corpógrafo can be used (with a username and password) at: http://www.linguateca.pthttp://www.linguateca.pt and http://poloclup.linguateca.pt/ferramentas/gc
27
ABRAPT Mini-curso 30.08.04
28
Corpógrafo 1.File Manager - area where each individual or group can: –convert various text formats to.txt –upload texts to their space on server –‘clean’ them of unnecessary material –check tokenization and sentence divisions –consult wordlists – alphabetical, frequency etc –group texts into corpora –register full information on source, domain and text type
29
ABRAPT Mini-curso 30.08.04 Corpógrafo 2. Corpora analysis area: – Concordancing tools allowing for KWIC concordancing KWIC concordancing with sorted according to word to left or right –N-gram tool N-grams Term-candidates –With filters for PT
30
ABRAPT Mini-curso 30.08.04 Corpógrafo 3. Terminology database –Terms –Definitions –Examples –Morphology –Multilingual equivalents –Sources and text details of corpora used –Semantic relations – further complexity
31
ABRAPT Mini-curso 30.08.04 Corpora Analysis Terminology Database Internet Text details
32
ABRAPT Mini-curso 30.08.04 Future developments – general policy General testing and improvement of the Corpógrafo Experimentation with ideas from other projects:- e.g. Wordnet, Framenet Experimentation with theories of semantic primitives, human universals etc Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities
33
ABRAPT Mini-curso 30.08.04 Future developments - File Manager Creation of overall framework – perhaps UDC based – for: –consultation of research available to public –information on ongoing research Coordination of individual corpus projects into bigger projects, when possible or necessary
34
ABRAPT Mini-curso 30.08.04 File Manager Theoretical questions Domain organization – UDC or ? Categorization of text by genre – how many genres? Reliability of texts from Internet – how does one guarantee quality? Is a translator or linguist able to distinguish a ‘good text’? Should the domain specialist choose the texts?
35
ABRAPT Mini-curso 30.08.04 Corpora construction theoretical questions / problems How large is a good domain corpus? No domain corpus will produce EVERY term in the area Comparable corpora v. Parallel corpora Aligning comparable corpora at term level
36
ABRAPT Mini-curso 30.08.04 Future developments - Corpora analysis Development of finer-grained concordancing Experimentation with finding definitions in context Semi-automatic creation of keyword shortlists for further text retrieval
37
ABRAPT Mini-curso 30.08.04 Corpora Analysis Theoretical questions How far can one rely on the computational linguist or computer engineer to produce analyses of corpora? If (semi-) automated processes produce 80% possible results, should the linguist / translator rubbish these processes? Can we leave it all the computer engineer?
38
ABRAPT Mini-curso 30.08.04 Future developments - terminology databases Refinement of terminology fields Development of further multi-lingual functions Development of organized and robust set of semantic relations Semi-automatic visualizing of semantic relations
39
ABRAPT Mini-curso 30.08.04 Terminology databases Theory How much information does a database need? How much does the user of a database need? Is it reasonable to hope that all our databases could one day communicate with each other and help us with translation / information retrieval – or whatever?
40
ABRAPT Mini-curso 30.08.04 How is the Corpógrafo being used at present? Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Translation and Localization
41
ABRAPT Mini-curso 30.08.04 How is the Corpógrafo being used at present? Dissertations completed on: –Definitions for different purposes + pedagogical glossary for Corrosion, Electrical engineering http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm http://www.fe.up.pt/~cdm/QAE/QAE_gloss_b.htm –Socioterminology – in the area of Composite Materials –Graphical representation of Conceptual systems –Terminology and Metaphors –Football Metaphors
42
ABRAPT Mini-curso 30.08.04 How is the Corpógrafo being used at present? Ongoing dissertations on aspects of: –Terminology – databases for different uses, neologisms, conceptual analysis –Corpora – text analysis, corpora construction –Translation and localization terminology –Technical writing > Electrical Appliances –Terminology in documentaries
43
ABRAPT Mini-curso 30.08.04 Pedagogical applications of the Corpógrafo Undergraduate courses – only possible if both teachers and students are trained to use it Postgraduate research –Terminology and translation (Belinda + domain experts) –Computational linguistics (Diana) –Information retrieval (Luís) Long live team work!
44
ABRAPT Mini-curso 30.08.04 To what extent is the Corpógrafo available to others? Linguateca’s policy is to make all resources and tools available online Primary users are expected to be Portuguese and Brazilian as most of resources and tools are for Portuguese PoloFLUP’s main objective – comparable corpora and terminology tools
45
ABRAPT Mini-curso 30.08.04 To what extent is the Corpógrafo available to others? PoloFLUP is, by definition, bi- or multi- lingual in interest The Corpógrafo is therefore available for experiments on a small scale to the general public In the future – we hope to be able to work on projects with users from other universities and other countries
46
ABRAPT Mini-curso 30.08.04 Contacts If you are interested is finding out more, please contact me: Belinda Maia bmaia@mail.telepac.pt The Corpógrafo can be used (with a username and password) at: http://www.linguateca.pthttp://www.linguateca.pt and http://poloclup.linguateca.pt/ferramentas/gc
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.