Download presentation
Presentation is loading. Please wait.
Published byImogen Paul Modified over 9 years ago
1
S ANDHAN Indian language search engine
2
S ANDHAN – C ONSORTIUM P ROJECT IIT Bombay (co-ordinator) CDAC Noida (co-cordinator) CDAC Pune IIT Kharaghpur Jadhavpur University ISI Kolkata IIIT Hyderabad AU KBC AU CEG Gauhati University DAIICT Gujarat IIIT Bhubaneswar TDIL 2
3
I NTRODUCTION Cross Lingual Information Retrieval (CLIR) engine for Indian languages Input: Query in one of the six Indian languages ( Hindi, Marathi, Tamil, Telugu, Bengali, Punjabi, Assamese. Gujarati, Oriya) Output: In Hindi, English and Query Language Currently in the second phase of the project Three new languages are added in second phase Assamese, Gujarati, Oriya Built on top of Nutch Framework 3
4
S OFTWARE U SED Nutch v0.9 – Framework Hadoop – Distributed Crawling Lucene – Indexing Moses/GIZA++ - Training models Tomcat – Deployment 4
5
5 Fetcher Web Analyzer MWE Lookup NE Lookup Domain Identifier Language Identifier Font Transcoder Indexer CMLifier UNL Index Snippet Translation Summary Generation Snippet Generation Translation /Transliteration MWE Lookup NE Lookup Analyzer Query Formulation Index Information Extraction
6
R ESOURCES D EVELOPED Language specific analyzers Stop word List Bilingual Dictionary ( X-English, X-Hindi) NE List MWE List Transliteration Models 6
7
IIT B OMBAY P ARTICIPATION Marathi Vertical Code Integration and Maintenance MWE Identification Development of Tracker Error Analysis Relevance Judgement 7
8
A CTION P LAN Public release of 5 languages monolingual search engine on April 14 th 2012 Bengali, Hindi, Marathi, Tamil, Telugu Public Release of remaining 4 languages monolingual search and 5 languages cross lingual search August 15 th 2012 Assamese, Gujarati, Oriya, Punjabi (Monolingual) Bengali, Hindi, Marathi, Tamil, Telugu (Cross lingual) 8
9
H ORIZONTAL T ASKS D ISTRIBUTION 9 Horizontal TaskResponsible Institute GUICDAC Pune Query Formulation IIIT Hyderabad Language/Domain Identifier Font-Transcoding CDAC Pune Crawling Information Extraction AU-KBC NE Identification MWE IdentificationIIT Bombay Ranking IIT Kharagpur Indexing CMLifier EvaluationISI Kolkata
10
D ISTRIBUTION OF V ERTICAL T ASKS LanguageResponsible Institutes Coordinating Institute Hindi IIT Bombay, IIIT Hyderabad, CDAC Noida CDAC Noida MarathiIIT Bombay, CDAC-PuneIIT Bombay Bengali IIT Kharagpur, JU, ISI Kolkata IIT Kharagpur PunjabiCDAC-Noida TamilAU-KBC, AU-CEGAU-KBC TeluguIIIT Hyderabad AssameseGauhati University OriyaIIIT Bhubaneshwar GujaratiDAIICT 10
11
K EY A CHIEVEMENTS Organized Forum for Information Retrieval (FIRE) 2008, 2010 and 2011 -a workshop for CLIR evaluation for Indian Languages Demonstrated a basic integrated version of the system at IJCNLP 2008 and ELITEX 2008. Media coverage by ‘ The Indian Express ’ news paper and ‘Hindustan Times’ (http://www.cfilt.iitb.ac.in/pb_1.JPG) (http://www.cfilt.iitb.ac.in/04_04_2009_010_007.jpg)http://www.cfilt.iitb.ac.in/pb_1.JPGhttp://www.cfilt.iitb.ac.in/04_04_2009_010_007.jpg Development of a strong and connected research community around CLIR in Indian languages. Publications in top IR and NLP forum 11
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.