S ANDHAN Indian language search engine
S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur Jadhavpur University ISI Kolkata IIIT Hyderabad AU KBC AU CEG Gauhati University DAIICT Gujarat IIIT Bhubaneswar TDIL 2
I NTRODUCTION Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the six Indian languages ( Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya) Output: In Hindi, English and Query Language Currently in the second phase of the project Three new languages are added in second phase Assamese, Gujarati, Oriya Built on top of Nutch Framework 3
S OFTWARE U SED Nutch v0.9 – Framework Hadoop – Distributed Crawling Lucene – Indexing Moses/GIZA++ - Training models Tomcat – Deployment 4
5 Fetcher Web Analyzer MWE Lookup NE Lookup Domain Identifier Language Identifier Font Transcoder Indexer CMLifier UNL Index Snippet Translation Summary Generation Snippet Generation Translation /Transliteration MWE Lookup NE Lookup Analyzer Query Formulation Index Information Extraction
R ESOURCES D EVELOPED Language specific analyzers Stop word List Bilingual Dictionary ( X-English, X-Hindi) NE List MWE List Transliteration Models 6
IIT B OMBAY P ARTICIPATION Marathi Vertical Code Integration and Maintenance MWE Identification Development of Tracker Error Analysis Relevance Judgement 7
A CTION P LAN Public release of 5 languages monolingual search engine on April 14 th 2012 Bengali, Hindi, Marathi, Tamil, Telugu Public Release of remaining 4 languages monolingual search and 5 languages cross lingual search August 15 th 2012 Assamese, Gujarati, Oriya, Punjabi (Monolingual) Bengali, Hindi, Marathi, Tamil, Telugu (Cross lingual) 8
H ORIZONTAL T ASKS D ISTRIBUTION 9 Horizontal TaskResponsible Institute GUICDAC Pune Query Formulation IIIT Hyderabad Language/Domain Identifier Font-Transcoding CDAC Pune Crawling Information Extraction AU-KBC NE Identification MWE IdentificationIIT Bombay Ranking IIT Kharagpur Indexing CMLifier EvaluationISI Kolkata
D ISTRIBUTION OF V ERTICAL T ASKS LanguageResponsible Institutes Coordinating Institute Hindi IIT Bombay, IIIT Hyderabad, CDAC Noida CDAC Noida MarathiIIT Bombay, CDAC-PuneIIT Bombay Bengali IIT Kharagpur, JU, ISI Kolkata IIT Kharagpur PunjabiCDAC-Noida TamilAU-KBC, AU-CEGAU-KBC TeluguIIIT Hyderabad AssameseGauhati University OriyaIIIT Bhubaneshwar GujaratiDAIICT 10
K EY A CHIEVEMENTS Organized Forum for Information Retrieval (FIRE) 2008, 2010 and a workshop for CLIR evaluation for Indian Languages Demonstrated a basic integrated version of the system at IJCNLP 2008 and ELITEX Media coverage by ‘ The Indian Express ’ news paper and ‘Hindustan Times’ ( ( Development of a strong and connected research community around CLIR in Indian languages. Publications in top IR and NLP forum 11