Sanchay and other NLP Tools Himanshu Sharma, Sambhav Jain
Sanchay Sanchay ⇔ संचय ( Sanchay ⇔ संचय – A Collection of Tools and APIs for Language Processing – An open source platform – Especially South Asian languages 2Sanchay and NLP Tools
Sanchay - Installation Platform Independent: Windows/Linux Pre-requisite: Sun (now Oracle) JDK 1.6Sun (now Oracle) JDK 1.6 Download – binaries Extract.zip OR.tgz Go to the extracted directory Ready !!! 3Sanchay and NLP Tools
Sanchay - Modules Editors – text, RTF, HTML Tree Creator Syntactic Annotation Alignment tools – Sentence – Word 4Sanchay and NLP Tools
Shallow Parser 9 Indian Languages – Hindi,Kannada,Malayalam,Marathi,Tamil,Telugu, Bengali,Punjabi,Urdu Does Tokenization + Morph Analysis + POS Tagging + Chunking Linux Platform ds/shallow_parser.php ds/shallow_parser.php 5Sanchay and NLP Tools
Shallow Parser - Installation Dependencies – ‘dos2unix’ & ‘unix2dos’ must be installed Download and Extract Install If libgdbm.so.2 doesn’t exist in /usr/lib/ then – sudo cp /usr/lib/libgdbm.so.3 /usr/lib/libgdbm.so.2 6Sanchay and NLP Tools
TNT POS Tagger TNT Tagger [ Train – tnt-para data.txt – Generates data.123 & data.lex Tag – tnt data file Evaluate – tnt-diff goldfile taggedfile 7Sanchay and NLP Tools
CRF++ - Chunker CRF++ [ Separate binaries for Linux as well Windows Installation –./configure – make – make install Sanchay and NLP Tools8
CRF++ - Chunker Train –./crf_learn template train_file model Tag/Test –./crf_test -m model testfile 9Sanchay and NLP Tools
Malt Parser (dependency parsing) MaltParser – [ Train – java –jar malt.jar –c model –i input file –m train Test – java –jar malt.jar –c model –i testfile –o output -m parse 10Sanchay and NLP Tools
Other NLP Tools Tookits – NLTK (Python) [ – OpenNLP(Java)[ – LingPipe(Java)[ Frameworks – GATE [ – Apache UIMA [ 11Sanchay and NLP Tools
12Sanchay and NLP Tools