ABRAPT Mini-curso 30.08.04 The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.

Slides:



Advertisements
Similar presentations
National Institute of Statistics, Geography and Informatics (INEGI) Implementation of SDMX in Mexico.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Map of Human Computer Interaction
Integrating translation technology at undergraduate level Belinda Maia University of Porto.
Corpora Linguistics The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
Corpora and the ‘general public’ Belinda Maia and Luís Sarmento Universidade do Porto.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Knowledge is Power Marketing Information System (MIS) determines what information managers need and then gathers, sorts, analyzes, stores, and distributes.
CS580: Building Web Based Information Systems Roger Alexander & Adele Howe The purpose of the course is to teach theory and practice underlying the construction.
Chapter 3 Software Two major types of software
LINGUATECA & Translation, terminology and research at the University of Porto Belinda Maia Universidade do Porto.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
The ECHA-term project Multilingual REACH and CLP Terminology Dieter Rummel, Translation Centre for the Bodies of the EU Luxembourg EAFT - Oslo, 11 October.
Lecturer: Ghadah Aldehim
1. Human – the end-user of a program – the others in the organization Computer – the machine the program runs on – often split between clients & servers.
Department of Computer and Information Science The Norwegian University of Science and Technology.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
CS 21a: Intro to Computing I Department of Information Systems and Computer Science Ateneo de Manila University.
DATA COMMUNICATION DONE BY: ALVIN SAMPATH CARLVIN SAMPATH.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Claudia Marzi Institute for Computational Linguistics, “Antonio Zampolli” – Italian National Research Council University of Pavia – Dept. of Theoretical.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Parser-Driven Games Tool programming © Allan C. Milne Abertay University v
Overview of technologies for translators and language service providers Belinda Maia University of Porto.
Structure of Study Programmes Bachelor of Computer Science Bachelor of Information Technology Master of Computer Science Master of Information Technology.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
TALC Applying some Developments in Corpus Building Technology to Language Teaching and Learning TALC 2006 Paris.
5 Marzo 2007 Census mapping and Gis Part II: dissemination Fabio Crescenzi Istat, Central Directorate on General Censuses UNECE Training Workshop on Census.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
1 CS430: Information Discovery Lecture 18 Usability 3.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
R esearching learner E nglish on a portfolio corpus --A research proposal for diachronic studies L i W enzhong.
Learning Objectives Understand the concepts of Information systems.
Dalit Gasul Department of Geography and Environmental Studies, University of Haifa CRI-Project Review Day, Tuesday, February 26, 2008.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
What do we know from research on:. Key points Digital games for learning have some distinctive features (see slide 3) Digital games for learning can have.
1 Using DLESE: Finding Resources to Enhance Teaching Shelley Olds Holly Devaul 11 July 2004.
Information Retrieval in Practice
Ricardo EIto Brun Strasbourg, 5 Nov 2015
What do we know from research on:
Terminology Extraction Tool (Auto/Semi-Auto)
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
UNIT 15 Webpage Creator.
CS 21a: Intro to Computing I
General Computer Applications by Barbara Teterycz
CIS 524 Possible Is Everything/tutorialrank.com
CIS 524 Education for Service/tutorialrank.com
European Network of e-Lexicography
Information Technology Ms. Abeer Helwa
(word formation: follow up)
Unit# 5: Internet and Worldwide Web
CSE 635 Multimedia Information Retrieval
Tutorial 7 – Integrating Access With the Web and With Other Programs
Presentation transcript:

ABRAPT Mini-curso The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA

ABRAPT Mini-curso A bit of history PALC ’97 – 'Do-it- yourself corpora... with a little bit of help from your friends!' CULT ‘Making corpora – a learning process’  Contrastive linguistics  Corpora linguistics  Translation teaching  General > specific language

ABRAPT Mini-curso A bit of history 2000 – First Master’s in Terminology and Translation at FLUP PALC ‘Training Translators in Terminology and Information Retrieval using Comparable and Parallel Corpora’  Specialized translation and terminology  Contact with domain experts  Importance of IT  Need for technical help for more ambitious students!

ABRAPT Mini-curso A bit of history LREC ‘Corpora for terminology extraction – the differing perspectives and objectives of researchers, teachers and language services providers’ 2002 – Second Master’s in Terminology and Translation at FLUP  Plea for help to Diana Santos  October 2002  LINGUATECA - Polo FLUP

ABRAPT Mini-curso LINGUATECA See Leader > Diana Santos (SINTEF – Oslo) Objective - to create resources and tools for the computational processing of Portuguese Poles at Oslo, Lisbon, Braga and Porto Porto – Polo CLUP/FLUP

ABRAPT Mini-curso Polo CLUP/FLUP See On-line suite of corpora tools to work with comparable corpora with emphasis on bilingual research –Focus on special domains –Construction of terminology databases, ontologies and domain models  Corpógrafo

ABRAPT Mini-curso Polo CLUP/FLUP See General help in constructing resources specific to the need of FLUP/CLUP –For researchers, teachers and students –For teaching methodology at FLUP  BNC & Reuter’s corpora on intranet  A small ‘chat’ corpus

ABRAPT Mini-curso More history 2003 – Poster of the GC – at CL – ‘What are comparable corpora?’ CL – Experimentation with evaluation of Machine Translation 2003 – Experimentation with GC 2003 – Third Master’s in Terminology and Translation at FLUP

ABRAPT Mini-curso GC – Integrated Web Environment for Corpora Linguistics Motivation Lack of Comprehensive, wide-scope Corpora Tools Commercial Packages are usually difficult to Integrate/Customize Tools are not prepared to support cooperative work. Linguistic knowledge is not usually integrated in tools. What is GC? GC is a Web tool being developed at Linguateca/CLUP that aims to provide a comprehensive work environment for Corpora-Based Linguistic Research. GC allows users to: access several Corpora tools from a single entry point using a regular web browser access and query generic Corpora (BNC, Reuter’s, COMPARA, CETEMPúblico) build personal simple, parallel and comparable Corpora from text files (PDF, PS, Word, HTML, TXT) use several (on-line/off-line) tools with their personal Corpora (statistics, POS-taggers, Filters, etc.) communicate and exchange results with other users Internet Integration GC provides seamless integration with the World Wide Web allowing users to: search specific Corpora resources on the Internet query the web for concordances use available translation-engines in parallel. DOC HTML TXT PS PDF RTF BNC CETEM Público COMPARA Others Personal Corpora Custom Interface DEV Inter-user Communication ADM USER Administrator’s Tasks: Users, Groups and Disk Quotas Corpora Taxonomy (see box) Documentation Organization Access Service Statistics Virtual Desktop Custom Interface Tool Pool Concordance Engine Taggers Aligner (Semi-Auto) Corpora Bot Statistics Custom Tools Internet Terminology DB Medium: written, spoken, multimedia Domain: Engineering, medicine, etc. Genre: scientific, technical, informative, etc. Corpora Taxonomy Terminology Extraction Tool (Auto/Semi-Auto) Developer Task: Developer’s Tasks: Integrate Existing Tools/Resources Develop Additional Generic Tools Interact with Users/Administrator Develop Custom Tools for particular research needs Inter-User Communication Tagging and Aligning Cooperatively Messaging Service Exchange of Corpora Resources Provide on-line tutorials Provide links to: on-line teaching material bibliography and other resources Teacher’s Tasks:

ABRAPT Mini-curso And then... PoloCLUP’s 3 rd function: Evaluation of Machine Translation –Experimentation with evaluation –Teaching + research focus Results: –TrAva – MT evaluation tool –CorTA – Corpus of 1 EN input + 4 MT output sentences

ABRAPT Mini-curso Prescriptive v descriptive terminology Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….

ABRAPT Mini-curso Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering  Standardized terminology  Getting the right word  Finding information  Perfecting Google  Structuring knowledge  Finding it fast

ABRAPT Mini-curso Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship

ABRAPT Mini-curso The Corpógrafo combines: Terminology, translation and language study and research (Belinda) Terminology databases (Domain experts) Computational linguistics research and production of resources (Diana) Information retrieval and artificial intelligence (Luís) = Discussions on priorities!

ABRAPT Mini-curso Corpora and Terminology Corpora as input Terminology extraction Terminology databases Structuring of domain knowledge Further corpora

ABRAPT Mini-curso Corpora Analysis Terminology Database Internet Text details

ABRAPT Mini-curso Working with the Corpógrafo Corpógrafo is a suite of integrated tools for INDIVIDUAL or GROUP research All research done ONLINE Each username/password = separate space on our server At present > anyone can work with it using 10 MB space for FREE BUT - you get an empty space + tools + tutorial!

ABRAPT Mini-curso Terminology old v new Prescriptive > descriptive Paper > digital form Static > dynamic resources ‘Democratization’ of terminology ISO standards > socioterminology Knowledge structures increasingly recognized as structured but dynamic - ask Gerhard Budin to explain this to you ….

ABRAPT Mini-curso Perspectives of terminology users Domain experts and vested interests Translators Information retrieval Knowledge engineering  Standardized terminology  Getting the right word  Finding information  Perfecting Google  Structuring knowledge  Finding it fast

ABRAPT Mini-curso Bridging the Gap General linguists Translation teachers Translation students Corpus linguists Computational linguists Computer engineers Computer-phobia Computer-worship

ABRAPT Mini-curso Focus of Corpógrafo Design priorities are to: –See the Big Picture –Create the Overall Framework –Get feedback from users to see their needs –Develop according to real research needs –Fill in the details and improve techniques as needed

ABRAPT Mini-curso Corpógrafo and special domains Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Technology – GPS – Geographical Positioning Systems

ABRAPT Mini-curso Corpógrafo and terminology/translation research Ongoing dissertations on aspects of: –Terminology – databases for different uses, neologisms, definition searches, semantic relations, conceptual analysis –Corpora – text analysis, corpora construction –Technical writing > Electrical Appliances –Localization –Terminology in documentaries –Translation of Multimedia

ABRAPT Mini-curso Linguateca Linguateca’s policy - all resources and tools freely available online Primary users - Portuguese and Brazilian

ABRAPT Mini-curso Polo CLUP/FLUP Bi- or multi-lingual in interest Corpógrafo available for experiments on a small scale to the general public Possibilities of future work on projects with users from other universities and other countries

ABRAPT Mini-curso Contacts If you are interested is finding out more, please contact me: Belinda Maia The Corpógrafo can be used (with a username and password) at: and

ABRAPT Mini-curso

Corpógrafo 1.File Manager - area where each individual or group can: –convert various text formats to.txt –upload texts to their space on server –‘clean’ them of unnecessary material –check tokenization and sentence divisions –consult wordlists – alphabetical, frequency etc –group texts into corpora –register full information on source, domain and text type

ABRAPT Mini-curso Corpógrafo 2. Corpora analysis area: – Concordancing tools allowing for KWIC concordancing KWIC concordancing with sorted according to word to left or right –N-gram tool N-grams Term-candidates –With filters for PT

ABRAPT Mini-curso Corpógrafo 3. Terminology database –Terms –Definitions –Examples –Morphology –Multilingual equivalents –Sources and text details of corpora used –Semantic relations – further complexity

ABRAPT Mini-curso Corpora Analysis Terminology Database Internet Text details

ABRAPT Mini-curso Future developments – general policy General testing and improvement of the Corpógrafo Experimentation with ideas from other projects:- e.g. Wordnet, Framenet Experimentation with theories of semantic primitives, human universals etc Development of new ideas or functions – using isomorphic relationships between researchers’ needs and our possibilities

ABRAPT Mini-curso Future developments - File Manager Creation of overall framework – perhaps UDC based – for: –consultation of research available to public –information on ongoing research Coordination of individual corpus projects into bigger projects, when possible or necessary

ABRAPT Mini-curso File Manager Theoretical questions Domain organization – UDC or ? Categorization of text by genre – how many genres? Reliability of texts from Internet – how does one guarantee quality? Is a translator or linguist able to distinguish a ‘good text’? Should the domain specialist choose the texts?

ABRAPT Mini-curso Corpora construction theoretical questions / problems How large is a good domain corpus? No domain corpus will produce EVERY term in the area Comparable corpora v. Parallel corpora Aligning comparable corpora at term level

ABRAPT Mini-curso Future developments - Corpora analysis Development of finer-grained concordancing Experimentation with finding definitions in context Semi-automatic creation of keyword shortlists for further text retrieval

ABRAPT Mini-curso Corpora Analysis Theoretical questions How far can one rely on the computational linguist or computer engineer to produce analyses of corpora? If (semi-) automated processes produce 80% possible results, should the linguist / translator rubbish these processes? Can we leave it all the computer engineer?

ABRAPT Mini-curso Future developments - terminology databases Refinement of terminology fields Development of further multi-lingual functions Development of organized and robust set of semantic relations Semi-automatic visualizing of semantic relations

ABRAPT Mini-curso Terminology databases Theory How much information does a database need? How much does the user of a database need? Is it reasonable to hope that all our databases could one day communicate with each other and help us with translation / information retrieval – or whatever?

ABRAPT Mini-curso How is the Corpógrafo being used at present? Master’s in Terminology and Translation Terminology projects with the support of domain specialists in: –Engineering – Electronics, Mechanical Engineering –Geography - Population Geography, Natural Hazards – Fire, Floods, Earthquakes, Coastal Erosion, –Medicine - Kidney support machines, Neurology –Science – Genetics –Translation and Localization

ABRAPT Mini-curso How is the Corpógrafo being used at present? Dissertations completed on: –Definitions for different purposes + pedagogical glossary for Corrosion, Electrical engineering –Socioterminology – in the area of Composite Materials –Graphical representation of Conceptual systems –Terminology and Metaphors –Football Metaphors

ABRAPT Mini-curso How is the Corpógrafo being used at present? Ongoing dissertations on aspects of: –Terminology – databases for different uses, neologisms, conceptual analysis –Corpora – text analysis, corpora construction –Translation and localization terminology –Technical writing > Electrical Appliances –Terminology in documentaries

ABRAPT Mini-curso Pedagogical applications of the Corpógrafo Undergraduate courses – only possible if both teachers and students are trained to use it Postgraduate research –Terminology and translation (Belinda + domain experts) –Computational linguistics (Diana) –Information retrieval (Luís) Long live team work!

ABRAPT Mini-curso To what extent is the Corpógrafo available to others? Linguateca’s policy is to make all resources and tools available online Primary users are expected to be Portuguese and Brazilian as most of resources and tools are for Portuguese PoloFLUP’s main objective – comparable corpora and terminology tools

ABRAPT Mini-curso To what extent is the Corpógrafo available to others? PoloFLUP is, by definition, bi- or multi- lingual in interest The Corpógrafo is therefore available for experiments on a small scale to the general public In the future – we hope to be able to work on projects with users from other universities and other countries

ABRAPT Mini-curso Contacts If you are interested is finding out more, please contact me: Belinda Maia The Corpógrafo can be used (with a username and password) at: and