Korea Maritime and Ocean University NLP Jung Tae LEE

Korea Maritime and Ocean University NLP Jung Tae LEE inverse90@nate.com

` 1. Introduction For Domain-specific Statistical Machine Translation setting  High-quality SMT system can only built with large quantities of parallel text  It is good enough to boost the performance of and out-of-domain system trained Proposed Method  Without worrying about the position of sentences in the text  Use a customized metric, which combines different similarity criteria

` SMT(Statistical Machine Translation)?  One of MT(Machine Translation)  It isn’t care about the rule of language  Don’t need manpower 2. What is SMT? But its not enough to parallel text for SMT system Representative services  Google’s translator  BingTranslator 1.http://translate.google.com 2.http://www.microsofttranslator.com

` Biligual Parallel Corpus Target Language Corpus Translation Model Language Model Statistical Alignment & AnalysisStatistical Language Modeling Decoder e1· · ·eI (target) f1· · · fJ (source) Find sentence best score

` IBM Model 1 : basic model in SMT  The IBM models are an instance of a noisy-channel model  Phrase-based statistical MT - it use the alignment algorithms just to find the best alignment for sentence pair (F, E), in order to help extract a set of phrases.  The parameters of the IBM models will be estimated using the expectation maximization (EM) algorithm

` EM algorithm (1) Initialization - Assign random(or uniform) probability (2) Maximization Step - Re-create cluster model - Re-compute the parameter (mean, variance) (3) Expectation Step - Update record's weight (4) Stopping criteria - Calculate log-likelihood, If the value saturate, exit, else Goto Step 2

` EM algorithm (1)Initialization green house the house casa verde la casa E = {green, house, the}, S = {casa, verde, la} Start with uniform probabilities :

` EM algorithm green house green house the house the house casa verde casa verde la casa la casa

` Compute the MLE probability parameters by normalizing the tcount sum to 1. total(green) = 1 total(house) = 2 total(the) = 1

` EM algorithm green house green house the house the house casa verde casa verde la casa la casa ※ Note that the two correct alignments are now higher in probability than the two incorrect alignments.

` 3. Introduced method Architecture of parallel sentence Extraction process

` Finding candidate article By interlanguage links Provided by wikipedia itself ->don’t worry about doc align Input queries contains the 100most frequent mountaineering Keywords in the *Text+Berg corpus *Text+Berg corpus : training data corpus ->This for avoid a false positive

` Finding parallel segments in Wikipedia articles The general idea is to split the sentence into clauses containing a single full verb. Similarity score criteria : 1. METEOR scroe - it can appreciate pronouns(like je and j’) 2. Number of alinged content word - it can increase the chance of other candidates with similar length

` Finding parallel segments in Wikipedia articles The distribution of the extracted clause pairs at different thresholds

` Finding parallel segments in Wikipedia articles The average sentence length for different score ranges

` Experiments and Results Extract 225,000 parallel clause pairs. (in 39,000 parallel articles) Perfect translationWith extra segmentmisalignment 39%26%35% Set of 200 automatically aligned clauses with similarity scores above 0.25 For this test set : However, given the high degree of parallelism between the clauses from The middle class, achieving a precision : 65%

` 4. MT Evaluation Settings : - use only pairs with a similarity score above 0.35 - SMT systems are trained with the Moses toolkit - baseline system is the same used for the automatic translations required in the extraction step - translation performance was measured using the BLEU evaluation metric

` BLEU score BLEU 의 기본 전제 아이디어는 기계번역이 전문 번역인에 의한 번역과 비슷할수록 기계번역의 질은 좋을 것이라는 것이다 전문 번역 기계 번역 번역기 비교 *P-value( 유의확률 ) : 귀무가설이 맞을 경우, 표본에서 얻은 표본평균보다 대립가설 쪽의 값이 나올 확률이 얼마나 되는지. 본 논문에서는 0.05~0.06

` If the data is of good quality, it can improve the performance of the system, otherwise it significantly deteriorates it. quantity of the data is not the decisive factor for the performance change, but rather the quality of the data SMT result for German-French Summary table

Korea Maritime and Ocean University NLP Jung Tae LEE inverse90@nate.com

Korea Maritime and Ocean University NLP Jung Tae LEE

Similar presentations

Presentation on theme: "Korea Maritime and Ocean University NLP Jung Tae LEE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Korea Maritime and Ocean University NLP Jung Tae LEE

Similar presentations

Presentation on theme: "Korea Maritime and Ocean University NLP Jung Tae LEE"— Presentation transcript:

Similar presentations

About project

Feedback