Presentation is loading. Please wait.

Presentation is loading. Please wait.

S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur.

Similar presentations


Presentation on theme: "S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur."— Presentation transcript:

1 S ANDHAN Indian language search engine

2 S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur Jadhavpur University ISI Kolkata IIIT Hyderabad AU KBC AU CEG Gauhati University DAIICT Gujarat IIIT Bhubaneswar TDIL 2

3 I NTRODUCTION Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the six Indian languages ( Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya) Output: In Hindi, English and Query Language Currently in the second phase of the project Three new languages are added in second phase Assamese, Gujarati, Oriya Built on top of Nutch Framework 3

4 S OFTWARE U SED Nutch v0.9 – Framework Hadoop – Distributed Crawling Lucene – Indexing Moses/GIZA++ - Training models Tomcat – Deployment 4

5 5 Fetcher Web Analyzer MWE Lookup NE Lookup Domain Identifier Language Identifier Font Transcoder Indexer CMLifier UNL Index Snippet Translation Summary Generation Snippet Generation Translation /Transliteration MWE Lookup NE Lookup Analyzer Query Formulation Index Information Extraction

6 R ESOURCES D EVELOPED Language specific analyzers Stop word List Bilingual Dictionary ( X-English, X-Hindi) NE List MWE List Transliteration Models 6

7 IIT B OMBAY P ARTICIPATION Marathi Vertical Code Integration and Maintenance MWE Identification Development of Tracker Error Analysis Relevance Judgement 7

8 A CTION P LAN Public release of 5 languages monolingual search engine on April 14 th 2012 Bengali, Hindi, Marathi, Tamil, Telugu Public Release of remaining 4 languages monolingual search and 5 languages cross lingual search August 15 th 2012 Assamese, Gujarati, Oriya, Punjabi (Monolingual) Bengali, Hindi, Marathi, Tamil, Telugu (Cross lingual) 8

9 H ORIZONTAL T ASKS D ISTRIBUTION 9 Horizontal TaskResponsible Institute GUICDAC Pune Query Formulation IIIT Hyderabad Language/Domain Identifier Font-Transcoding CDAC Pune Crawling Information Extraction AU-KBC NE Identification MWE IdentificationIIT Bombay Ranking IIT Kharagpur Indexing CMLifier EvaluationISI Kolkata

10 D ISTRIBUTION OF V ERTICAL T ASKS LanguageResponsible Institutes Coordinating Institute Hindi IIT Bombay, IIIT Hyderabad, CDAC Noida CDAC Noida MarathiIIT Bombay, CDAC-PuneIIT Bombay Bengali IIT Kharagpur, JU, ISI Kolkata IIT Kharagpur PunjabiCDAC-Noida TamilAU-KBC, AU-CEGAU-KBC TeluguIIIT Hyderabad AssameseGauhati University OriyaIIIT Bhubaneshwar GujaratiDAIICT 10

11 K EY A CHIEVEMENTS Organized Forum for Information Retrieval (FIRE) 2008, 2010 and 2011 -a workshop for CLIR evaluation for Indian Languages Demonstrated a basic integrated version of the system at IJCNLP 2008 and ELITEX 2008. Media coverage by ‘ The Indian Express ’ news paper and ‘Hindustan Times’ (http://www.cfilt.iitb.ac.in/pb_1.JPG) (http://www.cfilt.iitb.ac.in/04_04_2009_010_007.jpg)http://www.cfilt.iitb.ac.in/pb_1.JPGhttp://www.cfilt.iitb.ac.in/04_04_2009_010_007.jpg Development of a strong and connected research community around CLIR in Indian languages. Publications in top IR and NLP forum 11


Download ppt "S ANDHAN Indian language search engine. S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur."

Similar presentations


Ads by Google