Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata.

Slides:



Advertisements
Similar presentations
SP Business Suite Deployment Kick-off
Advertisements

Chapter 5: Introduction to Information Retrieval
Stefania Bergamasco, Cecilia Colasanti An integrated approach to turn statistics into knowledge combining data warehouse, controlled vocabularies and advanced.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Information Retrieval in Practice
Search Engines and Information Retrieval
Lecture 13 Revision IMS Systems Analysis and Design.
Search Engines Jan Damsgaard Dept. of Informatics Copenhagen Business School
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Lecture Nine Database Planning, Design, and Administration
Consortium Project on Development of Dravidian WordNet: An Integrated WordNet for Telugu, Tamil, Kannada and Malayalam.
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.
Overview of Search Engines
Evaluation of Hindi→English, Marathi→English and English→Hindi CLIR at FIRE 2008 Nilesh Padariya, Manoj Chinnakotla, Ajay Nagesh and Om P. Damani Center.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Web 2.0: Concepts and Applications 2 Publishing Online.
Introduction to Information System Development.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.
ArcGIS Workflow Manager An Introduction
Web Development Process Description
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Overview of the Database Development Process
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
FP OntoGrid: Paving the way for Knowledgeable Grid Services and Systems WP8: Use case 1: Quality Analysis for Satellite Missions.
Search Engines and Information Retrieval Chapter 1.
NERIL: Named Entity Recognition for Indian FIRE 2013.
C HU H AI C OLLEGE O F H IGHER E DUCATION D EPARTMENT O F C OMPUTER S CIENCE Preparation of Final Year Project Report Bachelor of Science in Computer Science.
ITEC 3220M Using and Designing Database Systems
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
PUBLISHING ONLINE Chapter 2. Overview Blogs and wikis are two Web 2.0 tools that allow users to publish content online Blogs function as online journals.
Software Requirements Engineering CSE 305 Lecture-2.
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee ( ) National Center for Science Information.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
IT 499 Bachelor Capstone Week 4. Adgenda Administrative Review UNIT three UNIT Four Project UNIT Five Preview Project Status Summary.
Moving into Implementation SYSTEMS ANALYSIS AND DESIGN, 6 TH EDITION DENNIS, WIXOM, AND ROTH © 2015 JOHN WILEY & SONS. ALL RIGHTS RESERVED.Roberta M. Roth.
Development of NE Wordnet: An Integrated Wordnet for Languages of the North-East India Assamese & Bodo by Utpal Saikia Biswajit Brahma Dibyajyoti Sarmah.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
02/19/13English-Indian Language MT (Phase-II)1 English – Indian Language Machine Translation Anuvadaksh Phase – II - The SMT Team, CDAC Mumbai.
Architectural Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
STASIS Technical Innovations - Simplifying e-Business Collaboration by providing a Semantic Mapping Platform - Dr. Sven Abels - TIE -
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Ad Hoc Graphical Reports Ad Hoc Graphical Reports Copyright © Team #4 CSCI 6838 Spring CSCI Research Project and Seminar Team# 4 (
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Requirement Engineering. Recap Elaboration Behavioral Modeling State Diagram Sequence Diagram Negotiation.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Rational Unified Process Fundamentals Module 4: Core Workflows II - Concepts Rational Unified Process Fundamentals Module 4: Core Workflows II - Concepts.
Irwin/McGraw-Hill Copyright © 2000 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS5th Edition.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
T Iteration Demo Tikkaajat [PP] Iteration
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
Thinking of Drupal 8? Get started with the resources.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Advanced Higher Computing Science
Information Retrieval in Practice
Products/Solutions/Expertise of C-DAC Mumbai in Smart City Domain
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Introduction to Software Testing
Indradhanush WordNet Project Consortium PRSG Meeting
Presentation transcript:

Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

2 Motivation

12 Dec 08FIRE– Kolkata - CLIA Project 3 CLIA is a real need Great language diversity in India Low comfort level with English less than 5% of the total population of about 700 million can use English effectively Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors CLIA project started in 2006: domains- tourism and health

12 Dec 08FIRE– Kolkata - CLIA Project 4 Geographically speaking Telugu tamil Bengali Marathi Punjabi World Rank in Terms of #speakers: Hindi-Urdu: 5 th Bengali: 7 th Marathi: 14 th …..

5 CLIA: basic information

12 Dec 08FIRE– Kolkata - CLIA Project 6 Defining Diagram

12 Dec 08FIRE– Kolkata - CLIA Project 7 CLIA Consortium Members Name of InstituteAssigned Language(s) IIT Bombay (Consortium Leader)Marathi, Hindi IIT-Kharagpur (consortium co-leader)Bengali IIIT HyderabadTelugu, Hindi Anna University-KBCTamil Anna University-College of EnggTamil ISI KolBengali Jadavpur University KolkataBengali CDAC-PuneMarathi, Hindi, Tamil CDAC-NoidaPunjabi Utkal University--

12 Dec 08FIRE– Kolkata - CLIA Project 8 Principal Investigators Name of InstituteNames IITB Prof. Pushpak Bhattacharyya IIT-KgpProf. Sudeshna Sarkar IIITHProf. Vasudev Verma AU-KBCProf. Sobha L. AU-CEGProf. Ranjani Parthasarthy ISI KolProf. Mandar Mitra JU KolProf. Sivaji Bandyopadhya CDAC-PDr. Ajai Kumar CDAC-NDr. Karunesh Arora Utkal UniversityProf. Sanghamitra Mohanty

12 Dec 08FIRE– Kolkata - CLIA Project 9 Some prominent research members Name of InstituteNames IITB Manoj, Vishal, Vishaal, Ashish IIT-KgpNimesh, Dr. Rajendra IIITHBhupal, Praneet AU-KBCPattavi, Vijay, Vijay AU-CEGKaviha, Subha Lalitha ISI KolPrasenjt, Deepashri, Ayan JU KolAsif, Pinaki CDAC-PSwati, Abhishek CDAC-NGaur Mohan, Ankur Utkal UniversityBalbant Rai

12 Dec 08FIRE– Kolkata - CLIA Project 10 Prior expertise brought to the project (Horizontal, i.e., language independent) Name of InstituteAreas of prior expertise/experience IITB NLP (LR, WSD, MT), Semantic Search IIT-KgpSearch and Ranking, Shallow Parsing IIITHCommercial level search engine building, query processing AU-KBCNER, Information Extraction, Summarization, Anaphora AU-CEGMorphology, Interlingua ISI KolIR Evaluation, large scale IR system building (SMART) JU KolExample based MT, Summarization, NER CDAC-PConverters, File format processors, MT CDAC-NParallel corpora, Query processing Utkal UniversityMachine Translation, Lexical Resources

12 Dec 08FIRE– Kolkata - CLIA Project 11 Prior expertise brought to the project (vertical, i.e., language specific) Name of InstituteAreas of prior expertise/experience IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing IIT-KgpBengali shallow parsing including MA IIITHTelugu-Eng CLIR, Telugu query processing AU-KBCTamil NER, Tamil IE, Tamil Morph AU-CEGTamil Morph, Eng-Tamil MT ISI KolBengali statistical stemming, large scale corpora for Bengali JU KolBengali NER, EBMT involving Bengali CDAC-PVarious Indian language converters CDAC-NAligned parallel corpora for Indian languages Utkal University--

12 Dec 08FIRE– Kolkata - CLIA Project 12 Horizontal tasks of CLIA and the organizations responsible Input Query processing IIIT Hyderabad Crawling, Indexing IIT KGP, IIITH, IITB Searching, Ranking IIT KGP, IIITH, IITB User Interface CDAC Noida File format processing CDAC Pune

12 Dec 08FIRE– Kolkata - CLIA Project 13 Horizontal tasks of CLIA and the organizations responsible (contd) Document Processing (index time NER, IE) AU KBC Document Processing (Post Retrieval: Snippet, Summary) Jadavpur University Distributed Search IIT KGP, Utkal, CDACP Evaluation, Relevance Judgement ISI Kolkata UNL based semantic search (for Tamil) AU CEG

12 Dec 08FIRE– Kolkata - CLIA Project 14 Languages and the organizations responsible LanguageOrganization(s) BengaliIIT KGP (c), JU, ISI HindiIIITH (c), IITB, CDAC Noida MarathiIITB (c), CDAC Pune PunjabiCDAC Noida TamilAUKBC (c), AUCEG TeluguIIITH

12 Dec 08FIRE– Kolkata - CLIA Project 15 CLIA Important Dates Project Start Date: 29 th Aug 06 (effectively Jan 2007) First meeting of the Project Review and Steering Group (PRSG): 2 nd March 2007 Second PRSG: 30 th Aug 2007 Third PRSG: 08 th March 2008 Fourth PRSG: 15 th July 2008 Alpha version released: 15 th July, 2008 Beta version to be released (along with the 5 th PRSG): January, 2009

12 Dec 08FIRE– Kolkata - CLIA Project 16 Related consortium: E-IL MT project English to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil Approaches: Statistical MT, Example Based MT Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH, IIITA

12 Dec 08FIRE– Kolkata - CLIA Project 17 Related consortium:IL-IL MT project Indian Language to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada Approach: Transfer Based Members: IIITH (c), CDAC Pune, IIT Bombay, JU, University of Hyderabad, AU KBC

12 Dec 08FIRE– Kolkata - CLIA Project 18 All three projects are time bound and result oriented 2 years time frame (extension granted for 1 year) Strict deliverables For each project the budget outlay is about Rs 80 million (USD 2 million)

19 CLIA: Top level technological information

12 Dec 08FIRE– Kolkata - CLIA Project 20 Process Flow

12 Dec 08FIRE– Kolkata - CLIA Project 21

22 CLIA: achievements in 2 years (Jan 2007 to Dec 2008) Tools and resources (Copyrightable code and data)

12 Dec 08FIRE– Kolkata - CLIA Project 23 Steps towards overall evaluation Yet to be completed Precision, Recall, MAP, F-score etc. Large Relevance judgment base under construction 50 queries per language (6 languages) About 5000 documents per language (6 languages) Crawled and indexed document base of English: approx 600,000 pages

12 Dec 08FIRE– Kolkata - CLIA Project 24 Copyright for CLIA (code) CodeDetails Input Processing Soft Keyboard (Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi Languages) (CDAC - P) Algorithm for transliteration of Devanagari words to English using Segment Based Transliteration (IIITH, IITB) Implementation of Multilingual Sense Dictionary along with API for accessing MSD during lexical substitution (IITB) Implementation of automatic Multi-word extraction algorithm for populating the multi-word field of index (IITB) BengaliBengali stemmer (IITKGP) Bengali Hindi transliteration (IITKGP) Marathi Implementation of Language Analyzers (Morphological Analyzer) for Marathi (IITB)

12 Dec 08FIRE– Kolkata - CLIA Project 25 Copyright for CLIA (code) contd. CodeDetails PunjabiPunjabi Spell Normalizer (CDAC-N) Punjabi Stemmer (CDAC-N) Font transcoders (Unicode - Proprietary fonts) - map files etc. (CDAC-N) Tamil Stemmer for Tamil (AUKBC) Named Entity Recognition engine (AUKBC) Information Extraction (AUKBC) Font transcoders (Tamil Proprietary fonts) (AUKBC) IE template Translation (AUKBC)

12 Dec 08FIRE– Kolkata - CLIA Project 26 Copyright for CLIA (code) Cont.. CodeDetails Telugu Language Analyzer for Telugu (IIITH) Query Translation for Telugu and Hindi (IIITH). Query Transliteration for all languages. (IIITH) Transcoder (IIITH) IndexingCML converter (IITKGP) Focused Crawler (IIITH) Language Identifier (IIITH) File Format Processors (CDACP)

12 Dec 08FIRE– Kolkata - CLIA Project 27 Copyright for CLIA (code) Cont.. CodeDetails Ranking Ranker implementation (IITKGP) Output ProcessingSnippet Generation (JU) Summary Generation (JU) Snippet Translation (JU) UNLSentence constituent UNL enconverter (AUCEG) UNL indexer (AUCEG) UNL Template based Information extractor (AUCEG) UNL Template based Summarizer (AUCEG) UNL based Search and ranking (ranking module under development) (AUCEG)

12 Dec 08FIRE– Kolkata - CLIA Project 28 Copyright for CLIA (data) DataDetails Input Processing Bengali Synset dictionary entries for Bengali (shared with JU and CDAC Pune) English to Bengali Transliteration of NE list (shared with JU and IIT KGP) NE annotated corpora (IITKGP) NE list transliterated (IITKGP) TeluguTelugu to English Dictionary (IIITH) Telugu to English Transliteration list (IIITH) NE annotated corpora for Telugu and Hindi. (IIITH) Telugu corpus developed for IE module. (IIITH)

12 Dec 08FIRE– Kolkata - CLIA Project 29 Copyright for CLIA (data) contd. DataDetails Input Processing TamilEnglish - Tamil Parallel Named Entity List (AUKBC) Tamil - English Dictionary (AUKBC) Synset dictionary entries for Tamil (AUKBC) Tamil Named Entity annotated corpus (AUKBC) English Named Entity annotated corpus (AUKBC) Named Entity Tagset (AUKBC)

12 Dec 08FIRE– Kolkata - CLIA Project 30 Copyright for CLIA Cont.. DataDetails PunjabiPunjabi translations ( for parallel corpora ) (CDAC-N) English - Hindi - Punjabi parallel named entity list (CDAC-N) Punjabi Named Entity Tagged Corpus (under development) (CDAC-N) Database for Punjabi stemmer (prior development) (CDAC-N) MarathiEnglish to Marathi Transliteration of NE list (IITB and CDAC Pune) Marathi-English parallel corpora in tourism domain used for training the snippet translation SMT system (IITB) List of Multi-Word Expressions in Marathi and Hindi (IITB) English-Marathi Parallel list of Named-entities used for IE Template translation (Shared with C-DAC Pune) Hindi Hindi to English Dictionary (IIIH) Hindi to English transliteration list (IIIH) Hindi MW list (IITB)

12 Dec 08FIRE– Kolkata - CLIA Project 31 Copyright for CLIA Cont.. DataDetails Evaluation of the IR system Set of test topics (general domain, tourism domain).(ISIK) Relevance judgments for the above pair.(ISIK) UNLUW list - Tourism domain (AUCEG)

12 Dec 08FIRE– Kolkata - CLIA Project 32 Conclusion Large scale national level activity Large number of tools and resources developed under the consortium Alpha release done in July, 2008 Beta release to take place in Jan, 2009 Look forward to more detailed interactions and suggestions from the international audience

33 Introducing people…

12 Dec 08FIRE– Kolkata - CLIA Project 34 Principal Investigators Name of InstituteNames IITB Prof. Pushpak Bhattacharyya IIT-KgpProf. Sudeshna Sarkar IIITHProf. Vasudev Verma AU-KBCProf. Sobha Nair AU-CEGProf. Ranjani Parthasarthy ISI KolProf. Mandar Mitra JU KolProf. Sivaji Bandyopadhya CDAC-PDr. Ajai Kumar CDAC-NDr. Karunesh Arora Utkal UniversityProf. Sanghamitra Mohanty

12 Dec 08FIRE– Kolkata - CLIA Project 35 Some prominent research members Name of InstituteNames IITB Manoj, Vishal, Vishaal, Ashish IIT-KgpNimesh, Dr. Rajendra IIITHBhupal, Praneet AU-KBCPattavi, Vijay, Vijay AU-CEGKaviha, Subha Lalitha ISI KolPrasenjt, Deepashri, Ayan JU KolAsif, Pinaki CDAC-PSwati, Abhishek CDAC-NGaur Mohan, Ankur Utkal UniversityBalbant Rai

12 Dec 08FIRE– Kolkata - CLIA Project 36 Overview Technical Status of the Project Technical Documentation Shared resources Testing methodology Software Documentation Alpha and Beta versions

Technical Summary

12 Dec 08FIRE– Kolkata - CLIA Project 38 Work Flow Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 39 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 40 Status - Input Processing Stemmer All Language stemmers developed Integrated with Nutch through plug-ins Monolingual retrievals are working MWE Guidelines are under discussion (IITB) Marathi ~ 2000 MWE Bangla ~ 600 MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE

12 Dec 08FIRE– Kolkata - CLIA Project 41 Status – Input Processing : NER LanguageNE-tagged Corpus size AccuracyNE list Details Hindi (IIITH)50K words68%31,177 entries English50K (AUKBC)88.5% (Precision) 73.7% (Recall) F-Score-80.44% 7,500 entries (AUKBC) Gazetteer List size (IITKgp) : Health-39,819 entries Tourism-90,848 entries General-4,79,427 entries Punjabi (CDACN) Not startedNAPerson-10,004 | City-500 | Company-500 Hospital-20,603 Marathi (IITB)50K61.43% (F-score)Total-4763 | Time-361 | Numerical-706 | Names Bengali (IITKgp) 125K (all domains) ~ 75-78%Bangla: 90,000 names (all domains) Gazetteer list is being transliterated to Bangla Tamil (AUKBC)94K88.5% (Precision) 73.7% (Recall) F-Score-80.44% NE-23,000 entries Dictionary of Personal names-70,000 (Tagged corpus + Dictionary used for NER) Telugu (IIITH)60K74%38,000 entries

12 Dec 08FIRE– Kolkata - CLIA Project 42 Status - Input Processing WSD (IITB) 2 nd version WSD Interface for Sense-marking of corpus developed by IITB Dictionary IITB working on E-Hin linkage All LVs working on IL-IL linking and E-IL linking ~10,000 synsets generated from Tourism corpora

12 Dec 08FIRE– Kolkata - CLIA Project 43 Status: Dictionary Eng-Hin Linkage ~ 2500 synsets linked (IITB) Language#Synsets linked (without cross-linking) Bengali2005 Marathi4298 (all cross-linked) Punjabi559 Tamil1890 Telugu461 IL-IL Dictionary Status (as on 30 Sept 07)

12 Dec 08FIRE– Kolkata - CLIA Project 44 Sample Input screen Input Screen

12 Dec 08FIRE– Kolkata - CLIA Project 45 Sample Input screen Advanced search option

12 Dec 08FIRE– Kolkata - CLIA Project 46 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 47 Status – Search Size of Indexed corpus LanguageNo of pagesNo of URLs English10, Hindi21,00025 Bangla3,00025 Tamil20,00025 Punjabi17,00025 Marathi3,30042

12 Dec 08FIRE– Kolkata - CLIA Project 48 Status – Search cML-Text Converter (IIT-Kgp) First version of the engine is ready Software extracts the fields and body, but does not identify paragraphs and blocks in this version Has been tested for Bengali Ready to be integrated with Nutch

12 Dec 08FIRE– Kolkata - CLIA Project 49 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 50 Status – Document Processing Basic IE Engine and eleven IE Templates are ready (AUKBC) Has been tested with sample documents (EILMT corpus) First template “How to reach the place” is getting translated to Tamil, Telugu For other languages, the inflectionary markers are being provided

12 Dec 08FIRE– Kolkata - CLIA Project 51 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 52 Sample Output Screen Output screen if Input language is Hindi

12 Dec 08FIRE– Kolkata - CLIA Project 53 Sample Output screen Output screen if Input language is Hindi, and English tab is selected

12 Dec 08FIRE– Kolkata - CLIA Project 54 Sample Output screen Output screen of translation of Snippet (English to Bengali)

12 Dec 08FIRE– Kolkata - CLIA Project 55 Sample Output Screen Advanced output screen with Hindi Summary

12 Dec 08FIRE– Kolkata - CLIA Project 56 Sample Output Screen Advanced output screen with Hindi Summary

12 Dec 08FIRE– Kolkata - CLIA Project 57 Sample Output Screen Sample screen with Information Extraction

12 Dec 08FIRE– Kolkata - CLIA Project 58 Status – Output Generation Snippet Generation (JU) Working for monolingual retrieval Integrated with Nutch Has been tested for Bengali

12 Dec 08FIRE– Kolkata - CLIA Project 59 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

12 Dec 08FIRE– Kolkata - CLIA Project 60 Corpora Tourism and Health Corpora being collected for all languages News corpora also being collected. Period of news corpora ranges from 2002 to 2007 For News corpora, ISI Kol having dialogues with TOI and Hindustan Times for permission for the use of their multilingual corpora Status - Evaluation

12 Dec 08FIRE– Kolkata - CLIA Project 61 Details of Corpora (crawled) Assumption in SRS: Each language corpus has at least 50,000 documents from General / News + all available documents in Tourism and Health

12 Dec 08FIRE– Kolkata - CLIA Project 62 Evaluation : Topics Topics (ISI Kol) A set of 95 topics are ready for evaluation 30 topics for training and 50 topics for testing and 15 topics as stand-by Each topic = Title + Narration + Description Translation of these 95 topics have been completed by all the six language verticals Sample Topic Euro Inflation Find documents about rises in prices after the introduction of the Euro Any document is relevant that provides information on the rise of prices in any country that introduced the common European currency.

12 Dec 08FIRE– Kolkata - CLIA Project 63 Evaluation Methodology Benchmark data creation Human judges Corpus Queries IR engine 1 IR engine 2 IR engine n Pool Relevance Judgements

12 Dec 08FIRE– Kolkata - CLIA Project 64 Evaluation Methodology Benchmark data creation Sample documents (corpus) Sample Queries / Topics (95) Relevance judgement No of relevance judged Bangla documents ~ 4,500 Independently judged against 23 topics by each of two judges Pooling Pooling strategies adopted by TREC List of top ~100 documents are taken Pool = union of these

12 Dec 08FIRE– Kolkata - CLIA Project 65 Evaluation methodology Evaluation engine 30 Topics/QueriesCorpus > 50,000 docs Retrieval Engine Top 100 Docs Evaluation Engine Relevance Judgments Metrics

12 Dec 08FIRE– Kolkata - CLIA Project 66 UNL Monolingual retrieval is working for Tamil documents 6500 words in UNL Dictionary Words + MWE indexed Documents indexed No. of documents processed in Tourism No of Concept-Relation-Concept indexed - 11,754 No of Concept-Relation indexed - 11,754 No of Concepts indexed - 17,650

12 Dec 08FIRE– Kolkata - CLIA Project 67 Testing Methodology Testing methodology Black box testing based on SRS and design documents Unit testing by each sub-system Test cases (format) and test reports Integration testing Top down / Bottom-up based on dependencies Stubs and drivers Sub-system wise testing (module-wise) Input processing Search and Retrieval Document processing Output Generation Evaluation UNL System Testing Performance testing

12 Dec 08FIRE– Kolkata - CLIA Project 68 Integration Use of controlled corpora for Integration Use of EILMT English and Hindi parallel corpus ISI generates the queries for corpus Translation of queries by all LVs English and Hindi synsets identified for building multilingual dictionary by each LV Each language vertical will be tested for their respective cross-lingual retrieval Information Extraction and output generation will be done on the same corpora Integration of each LV into Nutch at IITKgp

12 Dec 08FIRE– Kolkata - CLIA Project 69 Test and Integration (contd.) Bug tracking system (Bugzilla) to be installed Currently planned for installation at IITB on the same server as CVS Bugzilla Web-based general-purpose bug tracker tool Detects not only software bugs but also all other user-submitted tracking tickets Eases communication between team members Can be integrated with CVS and WIKI

12 Dec 08FIRE– Kolkata - CLIA Project 70 Bugzilla Requirements A compatible database management system – MySQL, Postgressql A suitable release of Perl 5 A compatible web server A suitable mail transfer agent, or any SMTP server Bugzilla Demo

12 Dec 08FIRE– Kolkata - CLIA Project 71 Bugzilla - Design Bugs can be submitted by anybody, and will be assigned to a particular developer

12 Dec 08FIRE– Kolkata - CLIA Project 72 Deployment diagram Deployment Diagram for Nutch-based Search Subsystem The real life scenario would have four more such index servers, one for every Indian language and (maybe) more search servers to ensure greater number of searches per unit time Quoted from Mike Cafarella, Doug Cutting, Building Nutch: Open Source Search, Queue, v.2 n.2, April 2004

12 Dec 08FIRE– Kolkata - CLIA Project 73 Hosting of Alpha and Beta versions Alpha Version ~10,000 documents in each language Low complexity system Hence simple hardware configuration sufficient Does not include Summary generation and Output translation Planned for Dec 2008 Beta Version ~10,00,000 documents in each language Hardware configuration being worked out - based on disk space requirements, throughput of system, response times, simultaneous users etc. Following details are being worked out: Connectivity Where to host Support for hosting Planned for July 2008

12 Dec 08FIRE– Kolkata - CLIA Project 74 Elitex08: Demo of Alpha Version Plan to demonstrate the following: Cross-lingual information retrieval for all languages Information Extraction and translation of at least one template to Tamil / Telugu Snippet Generation (monolingual) Hardware integration – IITKgp Publicity management / Poster design - JU Funds: Participation fees to be shared Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)

12 Dec 08FIRE– Kolkata - CLIA Project 75 Gantt chart (as on Aug 30)

12 Dec 08FIRE– Kolkata - CLIA Project 76 Gantt chart (as on Aug 30)

12 Dec 08FIRE– Kolkata - CLIA Project 77 SRS (Based on IEEE) SRS Design document v2.0 (based on RUP) Design document User Requirements Document (Ver 5.0) User Requirements Document Java docs Test cases template File naming conventions Testing and integration guidelines Code review guidelines Skip templates Software documentation

12 Dec 08FIRE– Kolkata - CLIA Project 78 Software documentation : SRS SRS Introduction Overall description External interface requirements System features (module-wise) Advanced Search system for Tamil using UNL  Back to Software DocumentationBack to Software DocumentationNext Next 

12 Dec 08FIRE– Kolkata - CLIA Project 79 Software documentation: DD Design document (v 2.0) Has been simplified to suit project needs Introduction System Architecture Solution Architecture (brief description of systems, subsystems) Software Architecture ( block diagrams) System Design Logical Design (Class Diagrams ) Component Design (Component Diagrams ) Appendix - other details  Back to Software DocumentationBack to Software DocumentationNext Next 

12 Dec 08FIRE– Kolkata - CLIA Project 80 Software documentation:URD URD Introduction Objective Scope of the project Product perspective Capabilities of the Product User Characteristics Assumptions and dependencies Operational environment Input / Output scenarios Definitions, acronyms and abbreviations References  Back to Software DocumentationBack to Software DocumentationNext Next 

12 Dec 08FIRE– Kolkata - CLIA Project 81 Software documentation:Test Test case template: for all tests Test caseTest dataExpected result Actual resultRemarks  Back to Software DocumentationBack to Software DocumentationNext Next 

12 Dec 08FIRE– Kolkata - CLIA Project 82 Software documentation:File naming File naming convention captures the following: Subject & domain of document Content Type (ppt / doc / rpt / Tr / etc) Name of Institute (IITB / ISI / IIITH etc.) Date of creation of doc (dd-mon-yy) Version no. Format _ _ _ _. E.g. PRSG_Pres_IITB_08dec07_v1.ppt  Back to Software DocumentationBack to Software DocumentationNext Next 

12 Dec 08FIRE– Kolkata - CLIA Project 83 Shareable Resources and Tools Shared Resources across projects From ILILMT to CLIA: Morph Analyzer POS Tagger Chunker Dictionary Standardization IL-IL Synsets From EILMT to CLIA Synsets E-IL From CLIA to other projects: NER engine NE list MWE

12 Dec 08FIRE– Kolkata - CLIA Project 84 Collaborative tools used - CLIA ToolPurpose GooglegroupsGroup ing WikiProject Documents, Member Contact details, Minutes of meeting, Presentations, Timelines, progress reports, fund details etc CVSSource code Google docsSharing and editing of documents Webex Audioconferencing Weekly teleconferences

12 Dec 08FIRE– Kolkata - CLIA Project 85 CLIA Wiki site CLIA Wiki contents Project Team Contact details Project documentation (SRS, Design doc, URD..) Meeting minutes and presentations Project fund details Progress reports and timelines Project resources Corpus Collaborative platform for audio conferences

12 Dec 08FIRE– Kolkata - CLIA Project 86 CLIA Wiki site

12 Dec 08FIRE– Kolkata - CLIA Project 87 Wiki – Upload notification

Thank You