LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.

Slides:



Advertisements
Similar presentations
Reporting on Exchange made simple! PROMODAG REPORTS for Microsoft Exchange Server.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
What is GNU EPrints 2? Creates Online Archive Free Software OAI Compliant Targeted at Scholarly Material Adaptable Extendable.
Part Two: Using Xaira to explore corpora Richard Xiao
IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Corpora Linguistics The Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
ICAME in CLARIN - a software demo of Corpuscle Knut Hofland Uni Research Computing Bergen, Norway ICAME 35, Nottingham.
IAEA International Atomic Energy Agency United Nations Library and Information Network for Knowledge Sharing (UN-LINKS) September 2013, Geneva.
Content Management, Working with WordPress Pavel Ivanov Telerik Corporation
IAEA International Atomic Energy Agency ICSTI 2013 Annual Members’ Meeting March 2013.
ARCHIMÈDE Presented by Guy Teasdale Directeur, Services soutien et développement Bibliothèque de l’Université Laval CARL Workshop on Institutional Repositories.
Technical Tips and Tricks for User Support Mike Gardner
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
USP workshop Using the Corpógrafo Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Corpora and the ‘general public’ Belinda Maia and Luís Sarmento Universidade do Porto.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
ABRAPT Mini-curso The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Columbia University Dept of Computer Science Center for Research on Info Access University of So. Calif Information Sciences Institute (ISI)
Greenstone Digital Library Usage and Implementation By: Paul Raymond A. Afroilan Network Applications Team Preginet, ASTI-DOST.
Presented by Mina Haratiannezhadi 1.  publishing, editing and modifying content  maintenance  central interface  manage workflows 2.
Research methods in corpus linguistics Xiaofei Lu.
Web Content Management Systems. Lecture Contents Web Content Management Systems Non-technical users manage content Workflow management system Different.
1 Network Statistic and Monitoring System Wayne State University Division of Computing and Information Technology Information Technology.
1/ 27 The Agriculture Ontology Service Initiative APAN Conference 20 July 2006 Singapore.
Towards Online Accessibility of Valuable Phenomena of the Bulgarian Folklore Heritage Radoslav Pavlov 1 Konstantin Rangochev 1 Desislava Paneva-Marinova.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
An introduction to WORDFAST. WHAT IS TRANSLATION MEMORY? Database that automatically stores and reuses your translations BENEFITS OF TM Increase productivity.
Break Out Session on Infrastructure and Technology: A Report Vipul Kashyap AOS Workshop, Rome, 15 November 2001
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
CountryData Development Improving the collation, availability and dissemination of development indicators (including the MDGs) Nairobi, 27 November 2013.
Ontology Summit2007 Survey Response Analysis -- Issues Ken Baclawski Northeastern University.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Data Management BIRN supports data intensive activities including: – Imaging, Microscopy, Genomics, Time Series, Analytics and more… BIRN utilities scale:
ENOMA - European Network of Online Musical Archives ENOMA Workshop – The Grieg Academy, UiB 26 May 2006 Leif Arne Rønningen and Lars Erik Løvhaug NTNU.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Food and Agriculture Organization of the UN Library and Documentation Systems Division Margherita Sini July 2005 Managing domain ontologies within the.
英 3B 戴偲婷. WConcord is a fast and easy to use concordancer for unlimited amounts of text. It allows the user to load multiple plain text files (.txt)
A Short Tutorial to Semantic Media Wiki (SMW) [[date:: July 21, 2009 ]] At [[part of:: Web Science Summer Research Week ]] By [[has speaker:: Jie Bao ]]
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Iana Atanassova Research: – Information retrieval in scientific publications exploiting semantic annotations and linguistic knowledge bases – Ranking algorithms.
Needs and Progress: Summary Flexible, powerful, modular atlas interface, and a query gateway to multiple types of data (GeneNetwork, Barlow, Smith, CCDB,
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Introduction to Linux Server Setup Jonathan Hood CSE 4000 Practical Issues in Software Engineering.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis The 17th International.
The New NAP Members’ Area Development. Elgg What is elgg? –Elgg is an award-winning open source social networking platform.
GNU EPrints 2 Overview Christopher Gutteridge 19 th October 2002 CERN. Geneva, Switzerland.
Reference Management Module I: Introduction By Rehema Chande-Mallya(PhD)
Etere Subtitling tool. Advantages Etere subtitle tool simplify the subtitle management It use all the resources of MAM as.
TRIG: Truckee River Info Gateway Dave Waetjen Graduate Student in Geography Information Center for the Environement (ICE) University of California, Davis.
JACoW / SPMS Joint Accelerator Conference Web (JACoW) Site Scientific Program Management System (SPMS) Conference Database Management Software Matt Arena,
Witold Staniszkis Empowering the Knowledge Worker End-User Software Engineering in Knowledge Management Witold Staniszkis
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Redefining computer-assisted interpreting tools
Terminology Extraction Tool (Auto/Semi-Auto)
Computational and Statistical Methods for Corpus Analysis: Overview
Building A Web-based University Archive
LCG Monte-Carlo Events Data Base: current status and plans

What’s New in Colectica 5.3 Part 1
The Re3gistry software and the INSPIRE Registry
Using GOLD to Tracking L2 Development
A new web-based corpus management and analysis platform
Presentation transcript:

LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates Term Candidates list store terminological entries, examples and Meta-Data in DB Regexp Concordance KWIC / Window N-Grams Corpógrafo under GPL soon. Multiple Corpógrafos installed in several university departaments and countries: the “Corpógrafo Community” Centralized database to collect terminology / conceptual maps from the Corpógrafo Community Large-Scale Terminology/ Knowledge Resources for Specialized Search Engines, Technical Writing, Translation, etc The future… Motivation Build an environment that helps users in the entire process of corpora research. The tool should not require advanced computer skills and should be easy to use by all types of users, from students to researchers. Functionalities required: Web access: use anywhere, anytime from any computer. No software installations. Collect texts: text extraction from structured files, downloading texts from the Web Text pre-processing: “cleaning” text, segmentation, text annotation, text encoding searchable or exchangeable format; Corpus search: regular expression concordances, collocation extraction, frequency based statistics (N-grams count); Information extraction: terminology, semantic relations, conceptual maps Knowledge-resource building: specific-domain glossaries, thesauri, terminological databases and ontologies; categorized word-lists; Comparable corpora studies: compilation and search over comparable corpora Exporting results to other formats and applications: to standard terminological databases, translation memories, etc. Terminology Extraction General Corpora Studies Collect Texts Text Extraction Corpora (several languages) Web DOC TXT PS PDF HTML create and manage multilingual Terminology DB’s Improving processing and research of the Portuguese language Fostering collaboration among researchers Providing public and free-of-charge tools to the community Linguateca – Our mission! Text Pre-Processing and Categorization (Meta-Data) Corpora search Term Definitions and Semantic Relations 1.edit term meta-data (source, authors, morphology, etc.) 2. match bilingual equivalents 3. obtain statistical information from corpora about each term 1.query DB, navigate DB 2.export DB to XML file 3.automatic generation of documentation (HTML) DCR JPEG WAV QT WMF Associate: 1.explanation videos / pictures 2.Sound file (pronounciation) Media file repository Two years after its debut at CL2003, Corpógrafo reaches version 3 Corpógrafo is now a mature environment, ready to be further expanded More than 100 regular users. More than 400 user accounts. Many lessons learned from practice: usability, technology, linguistics A corpus linguistics research community has grown along with Corpógrafo Large Terminology / Knowledge Engineering projects are now possible Corpógrafo V3: two years after… Have a look at (version 3 will be on-line in August 2005): Where to find Corpógrafo? Corpógrafo is built over SAGI, a web operative system developed by Linguateca. SAGI uses “LAMP”: Linux OS, Apache Web Server, MySQL RDBMS, Perl SAGI allows complete control over CGI processes and helps programmers build web interfaces Under the hood Luís Sarmento Belinda Maia Diana Santos Luís Cabral Ana Sofia Pinto Corpógrafo’s workflow overview: