Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata.

Similar presentations

Presentation on theme: "Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata."— Presentation transcript:

1 Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata

2 2 Motivation

3 12 Dec 08FIRE– Kolkata - CLIA Project 3 CLIA is a real need Great language diversity in India Low comfort level with English less than 5% of the total population of about 700 million can use English effectively Need for critical information in large quantity and high quality, especially in agriculture, health, tourism, education and sectors CLIA project started in 2006: domains- tourism and health

4 12 Dec 08FIRE– Kolkata - CLIA Project 4 Geographically speaking Telugu tamil Bengali Marathi Punjabi World Rank in Terms of #speakers: Hindi-Urdu: 5 th Bengali: 7 th Marathi: 14 th …..

5 5 CLIA: basic information

6 12 Dec 08FIRE– Kolkata - CLIA Project 6 Defining Diagram

7 12 Dec 08FIRE– Kolkata - CLIA Project 7 CLIA Consortium Members Name of InstituteAssigned Language(s) IIT Bombay (Consortium Leader)Marathi, Hindi IIT-Kharagpur (consortium co-leader)Bengali IIIT HyderabadTelugu, Hindi Anna University-KBCTamil Anna University-College of EnggTamil ISI KolBengali Jadavpur University KolkataBengali CDAC-PuneMarathi, Hindi, Tamil CDAC-NoidaPunjabi Utkal University--

8 12 Dec 08FIRE– Kolkata - CLIA Project 8 Principal Investigators Name of InstituteNames IITB Prof. Pushpak Bhattacharyya IIT-KgpProf. Sudeshna Sarkar IIITHProf. Vasudev Verma AU-KBCProf. Sobha L. AU-CEGProf. Ranjani Parthasarthy ISI KolProf. Mandar Mitra JU KolProf. Sivaji Bandyopadhya CDAC-PDr. Ajai Kumar CDAC-NDr. Karunesh Arora Utkal UniversityProf. Sanghamitra Mohanty

9 12 Dec 08FIRE– Kolkata - CLIA Project 9 Some prominent research members Name of InstituteNames IITB Manoj, Vishal, Vishaal, Ashish IIT-KgpNimesh, Dr. Rajendra IIITHBhupal, Praneet AU-KBCPattavi, Vijay, Vijay AU-CEGKaviha, Subha Lalitha ISI KolPrasenjt, Deepashri, Ayan JU KolAsif, Pinaki CDAC-PSwati, Abhishek CDAC-NGaur Mohan, Ankur Utkal UniversityBalbant Rai

10 12 Dec 08FIRE– Kolkata - CLIA Project 10 Prior expertise brought to the project (Horizontal, i.e., language independent) Name of InstituteAreas of prior expertise/experience IITB NLP (LR, WSD, MT), Semantic Search IIT-KgpSearch and Ranking, Shallow Parsing IIITHCommercial level search engine building, query processing AU-KBCNER, Information Extraction, Summarization, Anaphora AU-CEGMorphology, Interlingua ISI KolIR Evaluation, large scale IR system building (SMART) JU KolExample based MT, Summarization, NER CDAC-PConverters, File format processors, MT CDAC-NParallel corpora, Query processing Utkal UniversityMachine Translation, Lexical Resources

11 12 Dec 08FIRE– Kolkata - CLIA Project 11 Prior expertise brought to the project (vertical, i.e., language specific) Name of InstituteAreas of prior expertise/experience IITB Hindi Marathi wordnet building, Hindi Marathi shallow parsing IIT-KgpBengali shallow parsing including MA IIITHTelugu-Eng CLIR, Telugu query processing AU-KBCTamil NER, Tamil IE, Tamil Morph AU-CEGTamil Morph, Eng-Tamil MT ISI KolBengali statistical stemming, large scale corpora for Bengali JU KolBengali NER, EBMT involving Bengali CDAC-PVarious Indian language converters CDAC-NAligned parallel corpora for Indian languages Utkal University--

12 12 Dec 08FIRE– Kolkata - CLIA Project 12 Horizontal tasks of CLIA and the organizations responsible Input Query processing IIIT Hyderabad Crawling, Indexing IIT KGP, IIITH, IITB Searching, Ranking IIT KGP, IIITH, IITB User Interface CDAC Noida File format processing CDAC Pune

13 12 Dec 08FIRE– Kolkata - CLIA Project 13 Horizontal tasks of CLIA and the organizations responsible (contd) Document Processing (index time NER, IE) AU KBC Document Processing (Post Retrieval: Snippet, Summary) Jadavpur University Distributed Search IIT KGP, Utkal, CDACP Evaluation, Relevance Judgement ISI Kolkata UNL based semantic search (for Tamil) AU CEG

14 12 Dec 08FIRE– Kolkata - CLIA Project 14 Languages and the organizations responsible LanguageOrganization(s) BengaliIIT KGP (c), JU, ISI HindiIIITH (c), IITB, CDAC Noida MarathiIITB (c), CDAC Pune PunjabiCDAC Noida TamilAUKBC (c), AUCEG TeluguIIITH

15 12 Dec 08FIRE– Kolkata - CLIA Project 15 CLIA Important Dates Project Start Date: 29 th Aug 06 (effectively Jan 2007) First meeting of the Project Review and Steering Group (PRSG): 2 nd March 2007 Second PRSG: 30 th Aug 2007 Third PRSG: 08 th March 2008 Fourth PRSG: 15 th July 2008 Alpha version released: 15 th July, 2008 Beta version to be released (along with the 5 th PRSG): January, 2009

16 12 Dec 08FIRE– Kolkata - CLIA Project 16 Related consortium: E-IL MT project English to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Urdu, Oriya, Telugu, Tamil Approaches: Statistical MT, Example Based MT Members: CDAC Pune (c), IIT Bombay, JU, UU, IIITH, IIITA

17 12 Dec 08FIRE– Kolkata - CLIA Project 17 Related consortium:IL-IL MT project Indian Language to Indian Language MT Indian Languages: Hindi, Marathi, Bengali, Punjabi, Tamil, Telugu, Kannada Approach: Transfer Based Members: IIITH (c), CDAC Pune, IIT Bombay, JU, University of Hyderabad, AU KBC

18 12 Dec 08FIRE– Kolkata - CLIA Project 18 All three projects are time bound and result oriented 2 years time frame (extension granted for 1 year) Strict deliverables For each project the budget outlay is about Rs 80 million (USD 2 million)

19 19 CLIA: Top level technological information

20 12 Dec 08FIRE– Kolkata - CLIA Project 20 Process Flow

21 12 Dec 08FIRE– Kolkata - CLIA Project 21

22 22 CLIA: achievements in 2 years (Jan 2007 to Dec 2008) Tools and resources (Copyrightable code and data)

23 12 Dec 08FIRE– Kolkata - CLIA Project 23 Steps towards overall evaluation Yet to be completed Precision, Recall, MAP, F-score etc. Large Relevance judgment base under construction 50 queries per language (6 languages) About 5000 documents per language (6 languages) Crawled and indexed document base of English: approx 600,000 pages

24 12 Dec 08FIRE– Kolkata - CLIA Project 24 Copyright for CLIA (code) CodeDetails Input Processing Soft Keyboard (Hindi, Bengali, Tamil, Telugu, Punjabi, Marathi Languages) (CDAC - P) Algorithm for transliteration of Devanagari words to English using Segment Based Transliteration (IIITH, IITB) Implementation of Multilingual Sense Dictionary along with API for accessing MSD during lexical substitution (IITB) Implementation of automatic Multi-word extraction algorithm for populating the multi-word field of index (IITB) BengaliBengali stemmer (IITKGP) Bengali Hindi transliteration (IITKGP) Marathi Implementation of Language Analyzers (Morphological Analyzer) for Marathi (IITB)

25 12 Dec 08FIRE– Kolkata - CLIA Project 25 Copyright for CLIA (code) contd. CodeDetails PunjabiPunjabi Spell Normalizer (CDAC-N) Punjabi Stemmer (CDAC-N) Font transcoders (Unicode - Proprietary fonts) - map files etc. (CDAC-N) Tamil Stemmer for Tamil (AUKBC) Named Entity Recognition engine (AUKBC) Information Extraction (AUKBC) Font transcoders (Tamil Proprietary fonts) (AUKBC) IE template Translation (AUKBC)

26 12 Dec 08FIRE– Kolkata - CLIA Project 26 Copyright for CLIA (code) Cont.. CodeDetails Telugu Language Analyzer for Telugu (IIITH) Query Translation for Telugu and Hindi (IIITH). Query Transliteration for all languages. (IIITH) Transcoder (IIITH) IndexingCML converter (IITKGP) Focused Crawler (IIITH) Language Identifier (IIITH) File Format Processors (CDACP)

27 12 Dec 08FIRE– Kolkata - CLIA Project 27 Copyright for CLIA (code) Cont.. CodeDetails Ranking Ranker implementation (IITKGP) Output ProcessingSnippet Generation (JU) Summary Generation (JU) Snippet Translation (JU) UNLSentence constituent UNL enconverter (AUCEG) UNL indexer (AUCEG) UNL Template based Information extractor (AUCEG) UNL Template based Summarizer (AUCEG) UNL based Search and ranking (ranking module under development) (AUCEG)

28 12 Dec 08FIRE– Kolkata - CLIA Project 28 Copyright for CLIA (data) DataDetails Input Processing Bengali Synset dictionary entries for Bengali (shared with JU and CDAC Pune) English to Bengali Transliteration of NE list (shared with JU and IIT KGP) NE annotated corpora (IITKGP) NE list transliterated (IITKGP) TeluguTelugu to English Dictionary (IIITH) Telugu to English Transliteration list (IIITH) NE annotated corpora for Telugu and Hindi. (IIITH) Telugu corpus developed for IE module. (IIITH)

29 12 Dec 08FIRE– Kolkata - CLIA Project 29 Copyright for CLIA (data) contd. DataDetails Input Processing TamilEnglish - Tamil Parallel Named Entity List (AUKBC) Tamil - English Dictionary (AUKBC) Synset dictionary entries for Tamil (AUKBC) Tamil Named Entity annotated corpus (AUKBC) English Named Entity annotated corpus (AUKBC) Named Entity Tagset (AUKBC)

30 12 Dec 08FIRE– Kolkata - CLIA Project 30 Copyright for CLIA Cont.. DataDetails PunjabiPunjabi translations ( for parallel corpora ) (CDAC-N) English - Hindi - Punjabi parallel named entity list (CDAC-N) Punjabi Named Entity Tagged Corpus (under development) (CDAC-N) Database for Punjabi stemmer (prior development) (CDAC-N) MarathiEnglish to Marathi Transliteration of NE list (IITB and CDAC Pune) Marathi-English parallel corpora in tourism domain used for training the snippet translation SMT system (IITB) List of Multi-Word Expressions in Marathi and Hindi (IITB) English-Marathi Parallel list of Named-entities used for IE Template translation (Shared with C-DAC Pune) Hindi Hindi to English Dictionary (IIIH) Hindi to English transliteration list (IIIH) Hindi MW list (IITB)

31 12 Dec 08FIRE– Kolkata - CLIA Project 31 Copyright for CLIA Cont.. DataDetails Evaluation of the IR system Set of test topics (general domain, tourism domain).(ISIK) Relevance judgments for the above pair.(ISIK) UNLUW list - Tourism domain (AUCEG)

32 12 Dec 08FIRE– Kolkata - CLIA Project 32 Conclusion Large scale national level activity Large number of tools and resources developed under the consortium Alpha release done in July, 2008 Beta release to take place in Jan, 2009 Look forward to more detailed interactions and suggestions from the international audience

33 33 Introducing people…

34 12 Dec 08FIRE– Kolkata - CLIA Project 34 Principal Investigators Name of InstituteNames IITB Prof. Pushpak Bhattacharyya IIT-KgpProf. Sudeshna Sarkar IIITHProf. Vasudev Verma AU-KBCProf. Sobha Nair AU-CEGProf. Ranjani Parthasarthy ISI KolProf. Mandar Mitra JU KolProf. Sivaji Bandyopadhya CDAC-PDr. Ajai Kumar CDAC-NDr. Karunesh Arora Utkal UniversityProf. Sanghamitra Mohanty

35 12 Dec 08FIRE– Kolkata - CLIA Project 35 Some prominent research members Name of InstituteNames IITB Manoj, Vishal, Vishaal, Ashish IIT-KgpNimesh, Dr. Rajendra IIITHBhupal, Praneet AU-KBCPattavi, Vijay, Vijay AU-CEGKaviha, Subha Lalitha ISI KolPrasenjt, Deepashri, Ayan JU KolAsif, Pinaki CDAC-PSwati, Abhishek CDAC-NGaur Mohan, Ankur Utkal UniversityBalbant Rai

36 12 Dec 08FIRE– Kolkata - CLIA Project 36 Overview Technical Status of the Project Technical Documentation Shared resources Testing methodology Software Documentation Alpha and Beta versions

37 Technical Summary

38 12 Dec 08FIRE– Kolkata - CLIA Project 38 Work Flow Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

39 12 Dec 08FIRE– Kolkata - CLIA Project 39 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

40 12 Dec 08FIRE– Kolkata - CLIA Project 40 Status - Input Processing Stemmer All Language stemmers developed Integrated with Nutch through plug-ins Monolingual retrievals are working MWE Guidelines are under discussion (IITB) Marathi ~ 2000 MWE Bangla ~ 600 MWE Tamil ~ 600 MWE Punjabi ~ 4000 MWE

41 12 Dec 08FIRE– Kolkata - CLIA Project 41 Status – Input Processing : NER LanguageNE-tagged Corpus size AccuracyNE list Details Hindi (IIITH)50K words68%31,177 entries English50K (AUKBC)88.5% (Precision) 73.7% (Recall) F-Score-80.44% 7,500 entries (AUKBC) Gazetteer List size (IITKgp) : Health-39,819 entries Tourism-90,848 entries General-4,79,427 entries Punjabi (CDACN) Not startedNAPerson-10,004 | City-500 | Company-500 Hospital-20,603 Marathi (IITB)50K61.43% (F-score)Total-4763 | Time-361 | Numerical-706 | Names - 3666 Bengali (IITKgp) 125K (all domains) ~ 75-78%Bangla: 90,000 names (all domains) Gazetteer list is being transliterated to Bangla Tamil (AUKBC)94K88.5% (Precision) 73.7% (Recall) F-Score-80.44% NE-23,000 entries Dictionary of Personal names-70,000 (Tagged corpus + Dictionary used for NER) Telugu (IIITH)60K74%38,000 entries

42 12 Dec 08FIRE– Kolkata - CLIA Project 42 Status - Input Processing WSD (IITB) 2 nd version WSD Interface for Sense-marking of corpus developed by IITB Dictionary IITB working on E-Hin linkage All LVs working on IL-IL linking and E-IL linking ~10,000 synsets generated from Tourism corpora

43 12 Dec 08FIRE– Kolkata - CLIA Project 43 Status: Dictionary Eng-Hin Linkage ~ 2500 synsets linked (IITB) Language#Synsets linked (without cross-linking) Bengali2005 Marathi4298 (all cross-linked) Punjabi559 Tamil1890 Telugu461 IL-IL Dictionary Status (as on 30 Sept 07)

44 12 Dec 08FIRE– Kolkata - CLIA Project 44 Sample Input screen Input Screen

45 12 Dec 08FIRE– Kolkata - CLIA Project 45 Sample Input screen Advanced search option

46 12 Dec 08FIRE– Kolkata - CLIA Project 46 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

47 12 Dec 08FIRE– Kolkata - CLIA Project 47 Status – Search Size of Indexed corpus LanguageNo of pagesNo of URLs English10,000115 Hindi21,00025 Bangla3,00025 Tamil20,00025 Punjabi17,00025 Marathi3,30042

48 12 Dec 08FIRE– Kolkata - CLIA Project 48 Status – Search cML-Text Converter (IIT-Kgp) First version of the engine is ready Software extracts the fields and body, but does not identify paragraphs and blocks in this version Has been tested for Bengali Ready to be integrated with Nutch

49 12 Dec 08FIRE– Kolkata - CLIA Project 49 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

50 12 Dec 08FIRE– Kolkata - CLIA Project 50 Status – Document Processing Basic IE Engine and eleven IE Templates are ready (AUKBC) Has been tested with sample documents (EILMT corpus) First template “How to reach the place” is getting translated to Tamil, Telugu For other languages, the inflectionary markers are being provided

51 12 Dec 08FIRE– Kolkata - CLIA Project 51 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

52 12 Dec 08FIRE– Kolkata - CLIA Project 52 Sample Output Screen Output screen if Input language is Hindi

53 12 Dec 08FIRE– Kolkata - CLIA Project 53 Sample Output screen Output screen if Input language is Hindi, and English tab is selected

54 12 Dec 08FIRE– Kolkata - CLIA Project 54 Sample Output screen Output screen of translation of Snippet (English to Bengali)

55 12 Dec 08FIRE– Kolkata - CLIA Project 55 Sample Output Screen Advanced output screen with Hindi Summary

56 12 Dec 08FIRE– Kolkata - CLIA Project 56 Sample Output Screen Advanced output screen with Hindi Summary

57 12 Dec 08FIRE– Kolkata - CLIA Project 57 Sample Output Screen Sample screen with Information Extraction

58 12 Dec 08FIRE– Kolkata - CLIA Project 58 Status – Output Generation Snippet Generation (JU) Working for monolingual retrieval Integrated with Nutch Has been tested for Bengali

59 12 Dec 08FIRE– Kolkata - CLIA Project 59 Project Status Input Query Processing Search Output Generation Document Processing Evaluation Input Query in IL

60 12 Dec 08FIRE– Kolkata - CLIA Project 60 Corpora Tourism and Health Corpora being collected for all languages News corpora also being collected. Period of news corpora ranges from 2002 to 2007 For News corpora, ISI Kol having dialogues with TOI and Hindustan Times for permission for the use of their multilingual corpora Status - Evaluation

61 12 Dec 08FIRE– Kolkata - CLIA Project 61 Details of Corpora (crawled) Assumption in SRS: Each language corpus has at least 50,000 documents from General / News + all available documents in Tourism and Health

62 12 Dec 08FIRE– Kolkata - CLIA Project 62 Evaluation : Topics Topics (ISI Kol) A set of 95 topics are ready for evaluation 30 topics for training and 50 topics for testing and 15 topics as stand-by Each topic = Title + Narration + Description Translation of these 95 topics have been completed by all the six language verticals Sample Topic Euro Inflation Find documents about rises in prices after the introduction of the Euro Any document is relevant that provides information on the rise of prices in any country that introduced the common European currency.

63 12 Dec 08FIRE– Kolkata - CLIA Project 63 Evaluation Methodology Benchmark data creation Human judges Corpus Queries IR engine 1 IR engine 2 IR engine n Pool Relevance Judgements

64 12 Dec 08FIRE– Kolkata - CLIA Project 64 Evaluation Methodology Benchmark data creation Sample documents (corpus) Sample Queries / Topics (95) Relevance judgement No of relevance judged Bangla documents ~ 4,500 Independently judged against 23 topics by each of two judges Pooling Pooling strategies adopted by TREC List of top ~100 documents are taken Pool = union of these

65 12 Dec 08FIRE– Kolkata - CLIA Project 65 Evaluation methodology Evaluation engine 30 Topics/QueriesCorpus > 50,000 docs Retrieval Engine Top 100 Docs Evaluation Engine Relevance Judgments Metrics

66 12 Dec 08FIRE– Kolkata - CLIA Project 66 UNL Monolingual retrieval is working for Tamil documents 6500 words in UNL Dictionary Words + MWE indexed Documents indexed No. of documents processed in Tourism - 564 No of Concept-Relation-Concept indexed - 11,754 No of Concept-Relation indexed - 11,754 No of Concepts indexed - 17,650

67 12 Dec 08FIRE– Kolkata - CLIA Project 67 Testing Methodology Testing methodology Black box testing based on SRS and design documents Unit testing by each sub-system Test cases (format) and test reports Integration testing Top down / Bottom-up based on dependencies Stubs and drivers Sub-system wise testing (module-wise) Input processing Search and Retrieval Document processing Output Generation Evaluation UNL System Testing Performance testing

68 12 Dec 08FIRE– Kolkata - CLIA Project 68 Integration Use of controlled corpora for Integration Use of EILMT English and Hindi parallel corpus ISI generates the queries for corpus Translation of queries by all LVs English and Hindi synsets identified for building multilingual dictionary by each LV Each language vertical will be tested for their respective cross-lingual retrieval Information Extraction and output generation will be done on the same corpora Integration of each LV into Nutch at IITKgp

69 12 Dec 08FIRE– Kolkata - CLIA Project 69 Test and Integration (contd.) Bug tracking system (Bugzilla) to be installed Currently planned for installation at IITB on the same server as CVS Bugzilla Web-based general-purpose bug tracker tool Detects not only software bugs but also all other user-submitted tracking tickets Eases communication between team members Can be integrated with CVS and WIKI

70 12 Dec 08FIRE– Kolkata - CLIA Project 70 Bugzilla Requirements A compatible database management system – MySQL, Postgressql A suitable release of Perl 5 A compatible web server A suitable mail transfer agent, or any SMTP server Bugzilla Demo

71 12 Dec 08FIRE– Kolkata - CLIA Project 71 Bugzilla - Design Bugs can be submitted by anybody, and will be assigned to a particular developer

72 12 Dec 08FIRE– Kolkata - CLIA Project 72 Deployment diagram Deployment Diagram for Nutch-based Search Subsystem The real life scenario would have four more such index servers, one for every Indian language and (maybe) more search servers to ensure greater number of searches per unit time Quoted from Mike Cafarella, Doug Cutting, Building Nutch: Open Source Search, Queue, v.2 n.2, April 2004

73 12 Dec 08FIRE– Kolkata - CLIA Project 73 Hosting of Alpha and Beta versions Alpha Version ~10,000 documents in each language Low complexity system Hence simple hardware configuration sufficient Does not include Summary generation and Output translation Planned for Dec 2008 Beta Version ~10,00,000 documents in each language Hardware configuration being worked out - based on disk space requirements, throughput of system, response times, simultaneous users etc. Following details are being worked out: Connectivity Where to host Support for hosting Planned for July 2008

74 12 Dec 08FIRE– Kolkata - CLIA Project 74 Elitex08: Demo of Alpha Version Plan to demonstrate the following: Cross-lingual information retrieval for all languages Information Extraction and translation of at least one template to Tamil / Telugu Snippet Generation (monolingual) Hardware integration – IITKgp Publicity management / Poster design - JU Funds: Participation fees to be shared Demonstrate the same at IJCNLP08 exhibition (in Hyderabad - Jan 2008)

75 12 Dec 08FIRE– Kolkata - CLIA Project 75 Gantt chart (as on Aug 30)

76 12 Dec 08FIRE– Kolkata - CLIA Project 76 Gantt chart (as on Aug 30)

77 12 Dec 08FIRE– Kolkata - CLIA Project 77 SRS (Based on IEEE) SRS Design document v2.0 (based on RUP) Design document User Requirements Document (Ver 5.0) User Requirements Document Java docs Test cases template File naming conventions Testing and integration guidelines Code review guidelines Skip templates Software documentation

78 12 Dec 08FIRE– Kolkata - CLIA Project 78 Software documentation : SRS SRS Introduction Overall description External interface requirements System features (module-wise) Advanced Search system for Tamil using UNL  Back to Software DocumentationBack to Software DocumentationNext Next 

79 12 Dec 08FIRE– Kolkata - CLIA Project 79 Software documentation: DD Design document (v 2.0) Has been simplified to suit project needs Introduction System Architecture Solution Architecture (brief description of systems, subsystems) Software Architecture ( block diagrams) System Design Logical Design (Class Diagrams ) Component Design (Component Diagrams ) Appendix - other details  Back to Software DocumentationBack to Software DocumentationNext Next 

80 12 Dec 08FIRE– Kolkata - CLIA Project 80 Software documentation:URD URD Introduction Objective Scope of the project Product perspective Capabilities of the Product User Characteristics Assumptions and dependencies Operational environment Input / Output scenarios Definitions, acronyms and abbreviations References  Back to Software DocumentationBack to Software DocumentationNext Next 

81 12 Dec 08FIRE– Kolkata - CLIA Project 81 Software documentation:Test Test case template: for all tests Test caseTest dataExpected result Actual resultRemarks  Back to Software DocumentationBack to Software DocumentationNext Next 

82 12 Dec 08FIRE– Kolkata - CLIA Project 82 Software documentation:File naming File naming convention captures the following: Subject & domain of document Content Type (ppt / doc / rpt / Tr / etc) Name of Institute (IITB / ISI / IIITH etc.) Date of creation of doc (dd-mon-yy) Version no. Format _ _ _ _. E.g. PRSG_Pres_IITB_08dec07_v1.ppt  Back to Software DocumentationBack to Software DocumentationNext Next 

83 12 Dec 08FIRE– Kolkata - CLIA Project 83 Shareable Resources and Tools Shared Resources across projects From ILILMT to CLIA: Morph Analyzer POS Tagger Chunker Dictionary Standardization IL-IL Synsets From EILMT to CLIA Synsets E-IL From CLIA to other projects: NER engine NE list MWE

84 12 Dec 08FIRE– Kolkata - CLIA Project 84 Collaborative tools used - CLIA ToolPurpose GooglegroupsGroup e-Mailing WikiProject Documents, Member Contact details, Minutes of meeting, Presentations, Timelines, progress reports, fund details etc CVSSource code Google docsSharing and editing of documents Webex Audioconferencing Weekly teleconferences

85 12 Dec 08FIRE– Kolkata - CLIA Project 85 CLIA Wiki site CLIA Wiki contents Project Team Contact details Project documentation (SRS, Design doc, URD..) Meeting minutes and presentations Project fund details Progress reports and timelines Project resources Corpus Collaborative platform for audio conferences

86 12 Dec 08FIRE– Kolkata - CLIA Project 86 CLIA Wiki site

87 12 Dec 08FIRE– Kolkata - CLIA Project 87 Wiki – Upload notification

88 Thank You

Download ppt "Presentation of the CLIA Project by Pushpak Bhattacharyya, IIT Bombay, On behalf of the CLIA Consortium 12 Dec 2008 On the occasion of FIREatKolkata."

Similar presentations

Ads by Google