CAS-IA System Description Jinhua Du CNGL July 23, 2008
Outline Hardware in IA Pre-process & Data MT System Configuration for Evaluation Achievements Conclusions
Hardware Machines Parallel Computing –Condor –Grid Computing Module developed by ASR group TypeOperating SystemNumberCPURAM Desktop PCWindows 20039Pentium 4, 3.0G2.0G ServerLinux (Ubuntu)1Xeon 2.0G×416.0G
Pre-process & Data Pre-processing –encoding conversion & filter –punctuation and number conversion (full-shaped -> half-shaped, etc.) –case conversion (only the initial alphabet of the initial word), abbreviation processing –Chinese word segment (ICT or IA tool), English tokenization Data for NIST –Parallel: 3.4 M (if adds UN corpus, up to 10M) –Monolingual: 3.4M + 9.6M(gigaword1&2) + 1.4M(giga3) = 14.4M Data for IWSLT –Parallel: BTEC(20K or 40K); LDC –Monolingual: BTEC; Gigaword –Data Filter: only need the high correlation data, very important for spoken evaluation (More better data, more better performance)
System Configuration Modules –Pre-processing –Alignment Post-preprocessing & Models Generation –Decoding & MER Training –System Combination & Post-Processing
Achievements (zh-en) The 3 rd MT Symposia in China ( rank 3) –Limited (830K pairs) –Unlimited (3M pairs)
Achievements (zh-en) NIST MT Eval SystemBLEU-4IBM BLEU Primary ( combination ) HPB STTB PB
Achievements (zh-en) IWSLT2008 –More systems to be combined 2 PB systems developed by CASIA Moses SAMT (CMU) Hierarchical PB BTG-based system (Xiong) –Better performance (bleu+meteor)/ 2 bleumeteor (bleu+meteor)/ 2 bleumeteor tch.CRR nlpr.CRR
Conclusions More better data, better performance System combination is very helpful to improve the performance Evaluation is different from theoretical research: empirical methods and tricks are usually more effective For better rank, should be prepare in advance and build a temporarily team for evaluation Evaluation is a horrible thing for student: more time, more energy and no paper (joke but true) Develop systems for application purpose
Thanks