Download presentation
Presentation is loading. Please wait.
Published byGerald Dawson Modified over 9 years ago
1
2007/4/201 Extracting Parallel Texts from Massive Web Documents Chikayama Taura lab. M2 Dai Saito
2
2007/4/202 Purpose Parallel corpus : a set of parallel texts Parallel texts : translated pairs of texts Construct Parallel Corpora from the Web One thing was certain, that the WHITE kitten had had nothing to do with it. 一つ確実なのは、 白い子ネコはなんの関係も なかったということ。 --it was the black kitten's fault entirely. ―― もうなにもかも、 黒い子ネコのせいだったのです。 English 日本語
3
2007/4/203 Parallel Texts Useful resource for Statistical machine translation Dictionary construction But… existing corpora are not enough Genre Public Documents Software Manuals Language Limited English-French Amount Small Large human resource
4
2007/4/204 Parallel Texts from the Web Extracting Parallel Texts from Massive Web Documents Very large amount of texts Varied languages Small human resource
5
2007/4/205 Problems How to detect parallel texts automatically How to reduce calculation cost Web To construct parallel corpus 1.Extract candidate pairs 2.Judge whether they really are parallel texts
6
2007/4/206 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion
7
2007/4/207 STRAND [Resnik et. al. 03] URL Matching 1.Remove language-specific substrings[LSSs] (Japanese : ja, jp, jpn, euc, sjis,…) 2.Match LSSs-removed URLs 3.Make a detail comparison http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja http://www.hostname.com/index.html.en http://www.hostname.com/index.html.ja
8
2007/4/208 DOM Tree Alignment [Lei et. al. 06] HTML→DOM Tree Searching linked pages “alt” tag link name Parallel link: a pair of the same hyperlinks in parallel texts link “ English version ” “ In English ” etc …
9
2007/4/209 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion
10
2007/4/2010 Outline Web Detect parallel texts Extract candidate pairs … … … … Crawler
11
2007/4/2011 Detecting parallel texts Low comparison cost without HTML Information 1.word (noun) 2.semantic ID 3.comparison [Fukushima et.al. 06]
12
2007/4/2012 Semantic ID Conversion Constructing a graph from dictionaries Treating Japanese and English texts in the same level # of Semantic IDs: about 10,000 Sense 感覚 意味 Movie Film 映画 Hobby Taste 趣味 味 1 2 3
13
2007/4/2013 Texts to Vector テキスト 955 … 辞書 1704 … 数列 3173 辞書を使ってテキストを数列に変える。 17049553173 (955, 1704, 3173) sort +position information
14
2007/4/2014 Comparison tscore (translation score) T1:(106, 335, 455, 567, 1704, 3173, 7421) T2:(335, 567, 567, 1704, 4014, 5449, 7421) score=012 3 tscore = 4/(7+7) 4
15
2007/4/2015 tscore threshold Fry Corpus[05 Fry] 400 pair F-measure Speed 200,000 pairs/sec tscore threshold 0.102
16
2007/4/2016 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion
17
2007/4/2017 Extract candidate pairs Calculation cost of each comparison Calculation cost of extracting parallel texts A number of comparison: n^2 URL matching is too strict Japanese and English 90,000,000URL → 4,000 URL pairs → 1,000 real pairs
18
2007/4/2018 Calculation Cost Reduction →Reducing the number of comparison distance score : tscore Compare only texts close to each other Distance of each parallel texts and a sample text should be equal English 日本語 Sample
19
2007/4/2019 Calculation Cost Reduction Flow 1.Select sample texts (<<n) 2.Calculate distance score with sample texts 3.Classify top m score 4.Compare only for texts in the same group
20
2007/4/2020 Number of sample Calculation cost Accuracy (low risk of miss labeling) Methods to select sample Random k-means Sampling
21
2007/4/2021 k-means 1.Select k samples 2.Classify all texts 3.Calculate centers 4.Re-classify k=2
22
2007/4/2022 Calculation of tscore in k-means Text1:(106, 335, 455, 567, 1704, 3173, 7421) Text2:(335, 567, 567, 1704, 4014, 5449, 7421) Text1:(106, 335, 455, 567, 1704, 3173, 7421) Average1:((567, 0.2), (4014, 0.14), (7421, 0.5), …) tscore = 4/(7+7) tscore = (0.2+0.5) normal k-means
23
2007/4/2023 Converting HTML on the Web Guess language English, SJIS, EUC-JP, UTF-8 Convert character code Remove HTML Tag Morphological Analysis→pickup noun
24
2007/4/2024 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion
25
2007/4/2025 Experiment Calculation Cost Accuracy v.s. Calculation time Clustering k-means
26
2007/4/2026 Environment Dataset : Fry Corpus [Fry 05] Corpus of Japanese-English news pages Convert HTML to Semantic ID in advance Machine CPU : Xeon 2.4GHz Dual Memory : 2GB OS : Linux (Debian)
27
2007/4/2027 Calculation Cost Fry Corpus 200 - 6400 pairs Normal All-to-All Random sampling (Top3) # of texts grows, gap becomes wider Low cost with n^2 samples
28
2007/4/2028 Accuracy v.s. Calculation time Fry Corpus 400 pairs Random sampling # of sample grows, Miss classification ratio → high Execution time → low Trade off with Miss classification ratio and Execution time
29
2007/4/2029 Sample selection with k-means Accuracy and Execution time with k-means Flow Random sampling number of samples : √n 2.Calculating the center and re-sampling 3.Measuring Miss-classification ratio and Execution time
30
2007/4/2030 Evaluation of k-means Low miss-classification ratio →High biased miss classification calculation time [sec] 200random210.15 k-means40.32 400random510.54 k-means71.18
31
2007/4/2031 Agenda Introduction Related work Proposal Detect parallel texts Extract candidate pairs Experiment Conclusion
32
2007/4/2032 Conclusion and Future work Parallel texts from the Web Detecting parallel texts Extracting candidate pairs Random sampling k-means
33
2007/4/2033 Future work Better clustering methods Hierarchical Dimension reduction About 10,000 dimension is too high Processing real HTML texts from the Web
34
2007/4/2034 Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.