A PROPOSAL FOR CREATION OF A FOR INDIA Focus: linguistic data.

Slides:



Advertisements
Similar presentations
1 of 18 Information Dissemination New Digital Opportunities IMARK Investing in Information for Development Information Dissemination New Digital Opportunities.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
PD for Translators and Interpreters
Computational Paradigms in the Humanities – eHumanities and their role and impact in transdisciplinary research Gerhard Budin University of Vienna.
INDIA DISASTER RESOURCE NETWORK NATIONAL INSTITUTE OF DISASTER MANAGEMENT Ministry of Home Affairs, Government of India, New Delhi.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Importing Transfer Equivalencies: How to Maximize Efficiency How Columbia College Office of Registrar improved productivity through third party solutions.
Role of RAS in the Agricultural Innovation System Rasheed Sulaiman V
INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING NLP-AI IIIT-Hyderabad CIIL, Mysore ICON DECEMBER, 2003.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Languages & The Media, 5 Nov 2004, Berlin 1 New Markets, New Trends The technology side Stelios Piperidis
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
EE 399 Lecture 2 (a) Guidelines To Good Writing. Contents Basic Steps Toward Good Writing. Developing an Outline: Outline Benefits. Initial Development.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
James A. Senn’s Information Technology, 3rd Edition
Enhancement of the Event Management System Information Technology Services Knowledge Exchange Office Management Information Unit January 14, 2013.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
IT Job Roles Task 20. Software Engineer Job Description Software engineers are responsible for creating and maintaining software of various different.
PARIS21 CONSORTIUM MEETING Paris, October 2002 Progress Report of the Task Team on Food, Agriculture and Rural Statistics  Objectives  Past activities.
Korea Terminology Research Center for Language and Knowledge Engineering Infrastructures in Korea and for the Korean Language Key-Sun Choi.
Information and Communication Technologies in the field of general education in Armenia NATIONAL CENTER OF EDUCATIONAL TECHNOLOGIES.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Information Dissemination EENet Maria Ristkok Rhodes, 2004.
MAF’s M & E System Development Achievements and Ways Forward SWG-ARD, September 30, 2014 DoPC, MAF By: Somphathay Liengsone Deputy Director of Project.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Enlightening minds. Enriching lives. Tamil Digital Industry Badri Seshadri K.S.Nagarajan New Horizon Media.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
ICT business statistics and ICT sector: Uzbekistan’s experience Prepared by Mukhsina Khusanova.
NERIL: Named Entity Recognition for Indian FIRE 2013.
WLE Information Management. Discussion points  What systems do we have?  Which to use for what purpose?  What information is missing and can be improved.
INTOSAI Public Debt Working Group Updating of the Strategic Plan Richard Domingue Office of the Auditor General of Canada June 14, 2010.
Recent Activities of Speech Corpora and Assessment in Korea Yong-Ju Lee Wonkwang University Korea.
BUSINESS COMMUNICATION ENGB213
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Dr. Kristin Bakken, NO 2014 Oddrun Grønvik, NO 2014 Dr. Daniel Ridings, DOK Sept. 7th 2004.
1.8History of Java Java –Based on C and C++ –Originally developed in early 1991 for intelligent consumer electronic devices Market did not develop, project.
A New Start for EUTO Redruth, 29 September 2012 Henk Schüller.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
1 Seminar on 2008 SNA Implementation June 2010, Saint John’s, Antigua and Barbuda GULAB SINGH UN Statistics Division Diagnostic Framework: National.
1 NumericNumeric Developing a statistical framework for measuring the digitisation of Europe’s cultural heritage  Numeric  Phillip Ramsdale The study.
Interstate Statistical Committee of the Commonwealth of Independent States (CIS-Stat) Implementing the Global Strategy to Improve Agricultural and Rural.
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Copyright © 1994 Carnegie Mellon University Disciplined Software Engineering - Lecture 3 1 Software Size Estimation I Material adapted from: Disciplined.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
MIS 105 LECTURE 1 INTRODUCTION TO COMPUTER HARDWARE CHAPTER REFERENCE- CHP. 1.
Communicative and Academic English for the EFL Professional.
Metadata By N.Gopinath AP/CSE Metadata and it’s role in the lifecycle. The collection, maintenance, and deployment of metadata Metadata and tool integration.
 Programming - the process of creating computer programs.
ICT in Classroom Prepared by: Ymer LEKSI Kukes
Introduction A field survey of Dutch language resources has been carried out within the framework of a project launched by the Dutch Language Union (Nederlandse.
Learning Objectives Understand the concepts of Information systems.
The Claromentis Digital Workplace An Introduction
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
LO: To Learn an example of a TNC in the TERTIARY SECTOR to show how it operates in different parts of the world. HSBC – Outsourcing winners and losers.
New approach in EU Accession Negotiations: Rule of Law Brussels, May 2013 Sandra Pernar Government of the Republic of Croatia Office for Cooperation.
Best Practices Presentation SEPVE The Programme is co-funded by the European Union (ERDF) and National Funds of Greece and Bulgaria.
SPEECH TECHNOLOGY An Overview Gopala Krishna. A
Application Software Chapter 6.
United Nations Regional Workshop on the 2020 World Programme
Computational Linguistics: New Vistas
Applied Linguistics Chapter Four: Corpus Linguistics
Rating in 2002 for funding from 2003
Presentation transcript:

A PROPOSAL FOR CREATION OF A FOR INDIA Focus: linguistic data

What is ‘Linguistic Data’? Printed words - in different scripts, fonts, platforms & environments Domain-specific texts (e.g. 90-odd ones in current Indian languages corpora) Samples of Spoken Corpus – telephone talk, public lectures, formal discussions, in-group conversations, radio talks, natural language queries, etc. Hand-written samples Ritualistic Use of languages – scriptures, chanting, etc. Language of Performance - Reading, recitations, enactment But this data is of use only if it comes with linguistic analysis ‘Cause it must be tagged and aligned to be of use THAT’S WHAT CREATES AN IMPORTANT ROLE FOR LINGUISTS IN THIS ENTERPRISE

The Brown University text corpus was adopted to build statistical language models. TI-46 & TI DIGITS databases, of Texas Instruments (early 80's) distributed by NIST. The LDC at U-Penn was established in CIIL houses 45 million Word Corpora in 15 Indian lgs with DoE-TDIL support. CIIL has been distributing it to R&D groups the world over. Now converted into UNICODE jointly with the U of Lancaster and with another 45 million word Corpora from five Indian languages under Emille project coming in, it has been released in early CIIL is now working with Universities of Uppsala on corpora of lesser-known languages of India; See spokencorpus.net SO WHAT made us PROPOSE a LDC-IL? How the Idea of an Indian LDC Came about? The giant strides in IT that India has made. Because demands were made by several Software and Telecom giants – Reliance, IBM, HPLabs, Modular Syetems & Infosys. Due to suggestions of the Hindi Committee As decided in the 1 st ILPC meeting, 2004.

RECOLLECTING EVOLUTION OF THE PROPOSAL? Proposal evolved through discussion held with many Institutions in India and abroad. August 13, 2003: 1 st presentation at the MHRD, with the then ES in the chair, and FA, AS, J.S.(L), Director (L) and experts from C-DAC and IIT-Kanpur. August 17 and 18, 2003: An International Workshop on LDC was held at the CIIL, Mysore in collaboration with IIIT-Hyderabad and HPLabs, India. It was inaugurated by Smt. Kumud Bansal (the then AS & now Secretary, Elementary Ed), and attended by the J.S. (L). Those who created LDC in USA had participated. August 19, 2003: a follow up meeting of a smaller group was held at the Indian Institute of Science to thrash out further details. A Project Committee was set up.

The Project Drafting Committee had top NLP specialists and linguists with the Director CIIL as the Coordinator. Five experts from IIT-B, IIT-M, IISc, IIIT- Hyd, & CIIL with inputs from the industry. All changes were made through chats and exchanges, and after four after teleconferencing during Sept-Oct, Nov 18,’03: Modified proposal submitted. Dec 19, 2003: During the 2 nd ICON, representatives of lead Institutes met in Mysore to discuss the draft sent to the Ministry. Prof. Aravind Joshi also participated. January, 2004: With additional inputs, the proposal was modified. Feb 24, '04: A number of suggestions made (see minutes) during the 2 nd Presentation for ES, AS, JS(L), & IFD. April 16, 2004: After the presentation before TDIL Advisory Comm., DoE offers full support.

The importance of creation of a large data- archive of Indian languages is undeniable. In fact, it is this realization that resulted in government’s plan for corpora development in early ’90s. Indian languages often pose a difficult challenge for the specialists in AI/NLP. The technology developers building mass- application tools/products, have for long been calling for availability of linguistic data on a large scale. However, the data should be collected, organized and stored in a manner that suits different groups of technology developers. These issues require us to involve a number of disciplines like linguistics, statistics, & CS. Further, this data must be of high quality with defined standards. Resources must be shared, so that all R&D groups are benefited. All these are possible with a data consortium. Why LDC-IL?

Spoken language data & importance of phoneticians Numerous Indian languages, each with so many sound patterns identified/studied by phoneticians for centuries. The inventory of IPA is invaluable for spoken language corpus, but their identification from speech data requires finesse. For speech technology,we have to create both phonetics/ acoustics models of languages Even when it is now aided and eased by Visual Phonetics technology, as available in CIIL or TIFR labs, what we need in addition is trained phoneticians.

This ‘LDC’ has 100 plus agencies as its active users and members. Includes some non-western languages:Arabic,Chinese, Korean. The core operations of are self-supporting after ten years. The activities include maintaining the data archives, producing and distributing CD-ROMs, and arranging networked data distribution, etc. All these have provided a great impetus to R&D in the field of language technology for English and other European languages. It is proposed to adopt a similar approach in the Indian context. THE MODEL An ideal model of Consortium could be seen if we consider the Linguistic Data Consortium (LDC) hosted by the University of Pennsylvania. LDC (USA) is an open consortium of universities, companies & government R&D labs that creates, collects and distributes speech and text databases, lexicons, and other resources for R&D.

Who funded LDC in US? LDC was supported initially by US Govt grant IRI from the Information and Intelligent Systems division Also by a grant from the Human Computer Interaction Program of the National Science Foundation Powered in part by Academic Equipment Grant US from Sun Microsystems. No member institution could afford to produce this individually. Who managed? 1.Govt 2.Industry 3.University

Who will set up LDC-IL in India? What will it do actually? The Ministry of HRD through the Central Institute of Indian Languages (CIIL), Mysore along with other institutions working on Indian Languages technology like Indian Institute of Science, Bangalore, Indian Institutes of Technology at Mumbai and Chennai, as well as the International Institute of Information Technology, Hyderabad propose to set up this LDC-IL. It is proposed that they will be the Lead Institutions in this initiative, with CIIL as the coordinating body. LDC-IL will be an archive plus. Besides data, tools and standards of data representation and analysis must be developed. It will create, analyze, segment, tag, align, and upload different kinds of linguistic resources. It will accept electronic resources from authors, newspapers, publishers, film, TV, radio & process them for use of the community.

Potential Participants / Institutions in India IISc Bangalore; All Indian Institutes of Technology; IIITs at Hyderabad and elsewhere; ISI Calcutta/Hyderabad/Bangalore; C-DAC, Pune; TIFR Mumbai; Universities like U of Hyderabad; DU; JNU; NEHU HP Labs India; IBM; Infosys; Reliance Infocom; Language institutions like CIEFL, KHS, NCPUL & RSKS; All academic institutes, research organizations and Corporate R&D groups from India and abroad working on Indian languages will be encouraged to participate in LDC-IL. The following have already shown interest:

Major areas of Linguistic Resource Development as proposed Speech Recognition and Synthesis Character Recognition Creation of different kinds of Corpora NLP By-products : Word finders, lexicons of different kind, thesauri, Usage compilations etc.

Other possible applications Collocational restrictions for OCR building TTS: Statistical Probabilities models Build a speech recognition model Auto-summarization Develop Tree-bank tools Skeletal parses Will form a basis of MAT or MT systems IN A WAY, ALL THESE WILL ONLY BE COMPLEMENTARY TO WHAT IS BEING PLANNED / ENCOURAGED BY TDIL of MCIT, and will complement it perfectly

Funding & Management The core funding from the Government of India. It will span over two plan periods. All activities will be in a project mode and through CIIL’s PL account. All staff will be on contract. All receipts and payments through internet gateways, or through conventional means, will go to this special bank account. Will attempt to leverage expertise already available to cut avoidable cost and delay. As the nodal agency, CIIL will further distribute the relevant funding for specific sub-components of the scheme to other academic institutions. An annual progress report will be submitted to the government.

Arrangements

PAC of LDC-IL

Membership Differential rate of annual fee India: 1.Individual Researchers: Rs.2000/- per annum 2.Educational Institutions: Rs.20,000/- per annum 3.Software and related industry : Rs.2,00,000/- per annum Other countries : 1.Individual Researchers: $ 2,000/- per annum 2.Educational Institutions: $ 20,000/- per annum 3.Software and related industry : $ 50,000/- per annum GOES WITHOUT SAYING THAT THIS WOULD REQUIRE CONSTANT UPDATION AND UPGRADATION AS WELL AS EXPANSION OF OUR DATA / TOOLS / PRODUCTS

Estimation It is estimated that by the third year, LDC-IL will have 50 Institutional members from India, and 200 Indian scholars as individual members, contributing to Rs. 12 lakh annually. In addition, it is estimated to have at least 20 researchers from abroad as individual members, contributing to $ 40,000 or Rs. 20 lakhs more. The attempt will be to secure industrial support from the IT sector internationally to raise at least 10 institutional memberships initially, creating a corpus of $ 200,000 annually by/during the third year. Should that happen, it will generate a substantial amount for LDC-IL.

Budget: A broad indication* Rs lakhs per year. Total: Rupees lakhs for the next 8 years. 1. Human Resources: 69,84, Tasks: 64,76, Events (Meetings, workshops, seminars & Training programs) : 50,00, Equipments & maintenance: 27,00, IPR costs & publications: 10,00,000 Total: Rs. 2,21,60,000 NB: The Director CIIL on the advise of the Project Advisory Committee of the LDC-IL may be authorized to re-appropriate funds from among the heads indicated here, without exceeding the overall budget. In case the people in service in the Government or Autonomous Institutions in substantial capacity are selected their service and salary will be protected.

Resource Generation- Details The first 2 years of the project are incubation years. It would take time to set up, and test-run tools and deliverables & advertise. It is estimated that from the third year onwards, the annual revenue may be 8% to 10% of the annual investment, i.e. Rs lakhs to Rs lakhs contributing to Corpus Fund. 6th year on, it will be around 25% to 35% of the amount invested, i.e. Rs.55.4 lakhs to Rs lakhs annually. At the end of eight years, there will be at least Rs lakhs to Rs lakhs plus interests in corpus funds. Hopefully, there will be new lead institutions to contribute to corpus fund further, once LDC-IL works in full swing.

Core Operations to be self- supporting Beyond eight years, Govt may support only events (Rs.50 lakhs from CIIL’s OC-Plan), tasks of software development (Rs lakhs from our OE-Plan), and maintenance of equipments (Rs lakhs from OE-Non-Plan), i.e. Rs.130 lakhs a years. The services of the personnel and the IPR costs will be paid from 6% interests of the corpus funds (Rs lakhs) plus anticipated annual income, i.e lakhs, i.e. Rs lakhs generated annually. With Rs.130 lakhs as above, the total comes to Rs lakhs (approx).

Thank you

Speech Recognition and Synthesis: Objectives 1. Primarily to build speech recognition and synthesis systems. 2. Although there are ASR & TTS systems for many western languages, commercially viable speech systems are unavailable. 3. Voice User Interfaces for IT applications and services, useful especially in telephony-based applications. 4. If such technology is available in Indian languages, people in various semi- urban and rural parts of India will be able to use telephones and Internet to access a wide range of services and information on health, agriculture, travel, etc. 5. However, for this a computer has to be able to accept speech input in the user’s language and provide natural speech output. 6. Also in India, if speech technology is coupled with translation systems between the various Indian languages. 7. The main obstacle is to customize this technology for various Indian languages is the lack of appropriate annotated speech databases. 8. Focus: (i) to collect data that can be used for building speech enabled systems in Indian languages and (ii) to develop tools that facilitate collection of high quality speech data.

Goals – long & short term

Methodology

Possible Applications: Speech to Speech translation for a pair of Indian languages, namely, Hindi and Telugu. Command and control applications. Multimodal interfaces to the computer in Indian languages. readers over the telephone. Readers for the visually disadvantaged. Speech enabled Office Suite. The effort for both Speech Recognition and Speech Synthesis will be repeated across all 22 Scheduled languages. For Speech Recognition, spontaneous speech data will be collected along with read speech. For speech synthesis, data will be collected from professional speakers, with very good voice quality. Additional speech data will be collected to come out with models for prosody (intonation, duration, etc.) to improve the naturalness of synthesized speech. A database (lexicon) of proper names (of Indian origin) will be created, with the equivalent phonetic representation for each of the names.

Character Recognition Character Recognition refers to the conversion of printed or handwritten characters to a machine-interpretable form. ”Online” handwriting recognition or Online HWR refers to the interpretation of handwriting captured dynamically using a handheld or tablet device. It allows the creation of more natural handwriting- based alternatives to keyboards for data entry in Indian scripts, and also for imparting of handwriting skills using computers. “Offline” handwriting recognition or Offline HWR refers to the interpretation of handwriting captured statically as an image. Optical character recognition or OCR refers to the interpretation of printed text captured as an image. It can be used for conversion of printed or typewritten material such as books and documents into electronic form. These different areas of language technology require different algorithms and linguistic resources. They are all hard research problems because of the variety of writing styles and fonts encountered. Of these, OCR has seen some research in a few Indian scripts because of support from the TDIL program. However the technology is not yet mature and there is only one commercial offering.

Possible Applications

Natural Language Processing Electronic dictionaries: Electronic dictionaries are a primary requisite for developing any software in NLP. ED 1 Monolingual/bilingual dictionaries 25,000 words per year (per language) ED 2. Transfer Lexicon and Grammar(TransLexGram) (per language) Transfer Lexicon and Grammar above involves developing a language resource which would contain English Headwords Their grammatical category Their various senses in Hindi Corresponding sense in the other Indian language An example sentence in English for each sense of a word Corresponding translation in the concerned Indian language o In case of verbs, parallel verb-frames from English to Indian language. As is obvious from the above, TransLexGram will be a rich lexicon which will not only contain the word level information but also the crucial information of verb- argument structure and the vibhaktis with specific senses of a verb. The resource, once created will be a parallel resource not only between English and Indian languages but also across all Indian languages.

Creation of Corpora Domain Specific Corpora: Apart from these basic text corpora creation an attempt will be made to create domain specific corpora in the following areas : a. Newspaper corpora b. Child language corpus c. Pathological speech/language data d. Speech error Data e. Historical/Inscriptional databases of Indian languages which is one of the most important to trace not only as the living documents of Indian History but also historical linguistics of Indian languages. f. Grammars of comparative/descriptive/reference are needed to be considered as corpus of databases. g. Morphological Analyzers and morphological generators.

POS tagged corpora Part-of-speech (or POS) tagged corpora are collections of texts in which part of speech category for each word is marked. To be developed in a bootstrapping manner. First, manual tagging will be done on some amount of text. Then, a POS tagger which uses learning techniques will be used to learn from the tagged data. After the training, the tool will automatically tag another set of the raw corpus. Automatically tagged corpus will then be manually validated which will be used as additional training data for enhancing the performance of the tool.

Other kinds of Corpora Chunked corpora: The chunked corpora will also be prepared in a manner similar to the POS tagging. Here also the initial training set will be a complete manual effort. Thereafter, it will be a man- machine effort. That is why, the target in the first year is less and double in the successive years. Chunked corpora is a useful resource for various applications. Semantically tagged corpora: The real challenge in any NLP and text information processing application is the task of disambiguating senses. In spite of long years of R & D in this area, fully automatic WSD with 100% accuracy has remained an elusive goal. One of the reasons for this shortcoming is understood to be the lack of appropriate and adequate lexical resources and tools. One such resource is the "semantically tagged corpora".

Syntactic tree bank: Preparation of this resource requires higher level of linguistic expertise and needs more human effort. First, experts will manually tag the data for syntactic parsing. Since, a crucial point related to this task is to arrive at a consensus regarding the tags, degree of fineness in analysis and the methodology to be followed. This calls for some discussions amongst the scholars from varying fields such as Sanskritists, linguistics and computer scientists. It will be achieved through conduct of workshops and meetings. Parallel aligned corpora: A text available in multiple languages through translation constitutes parallel corpora. NBT & Sahitya Akademi are some of the official agencies who develop parallel texts in different languages through translation. Such Institutions have given permission to CIIL to use their works for creation of electronic versions of the same as parallel corpora. The literary magazines and news paper houses with multiple language editions will have to be approached for parallel corpora. Computer programmes have to be written for creating [I] Aligned texts; [II] Aligned sentences; and [III] Aligned chunks.

Corpora Tools 1.Tools for Transfer Lexicon Grammar (including creation of interface for building Transfer Lexicon Grammar) 2. Spellchecker and corrector tools 3. Tools for POS tagging. (Trainable tagging tool + an Interface for editing POS tagged corpora) 4. Tools for chunking (Rule-based language-independent chunkers) 5. Interface for chunking (Building an interface for editing and validating the chunked corpora) 6.Tools for syntactic tree bank, incl. interface for developing syntactic tree bank 7. Tools for semantic tagging with basic resources are the Indian language WordNets showing a browser that has two windows – one showing the senses (i.e., synsets) from the WordNet appear in the other window, after which a manual selection of the sense can be done 8. (Semi) automatic tagger based on statistical NLP (the preliminary version of which is ready in IITB) 9. Tools for text alignment, including Text alignment tool, Sentence alignment tool and Chunk alignment tool as well as an interface for aligning corpora