Download presentation
Presentation is loading. Please wait.
Published byGordon Pearson Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor : Dr. Hsu Graduate : Wen-Hsiang Hu Authors : Wen-Hsiang Lu ACM, June 2002
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Anchor Text Mining Probabilistic Inference Model Query Translation System Experiments Discussion Conclusion Personal Opinion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation One of the existing difficulties in cross-language information retrieval (CLIR) and Web search is the lack of appropriate translations of new terminology and proper names.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective automatically extracting translations of Web query terms
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction In this paper, we are interested in discovering translations of new terminology and proper names through mining Web anchor texts. the problems of precious research methods parallel corpora for various subject and multiple languages lack of parallel correlation between word pairs short query terms Yahoo 雅虎 Yahoo 雅虎 美國雅虎 搜尋、雅虎.. 雅虎
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 We use a triple form to indicate that page U j points to page U i with description text D k. For a Web page (or URL) U i, its anchor-text set AT(U i ) is defined as all of the anchor texts of the links pointing to U i, i.e., U i ’s inlinks. For a query term appearing in AT(U i ), it is likely that its corresponding translations also appear together. Anchor Text Mining Ui UjUj UjUj UjUj UjUj UjUj
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 asymmetric similarity estimation model cause some common terms may become the best translations. symmetric similarity estimation function based on the probabilistic inference model defined first below: Probabilistic Inference Model where T t is target translation ; T s is source term, the inductive rule “if Ts then Tt”, i.e. P( Ts→Tt). (2) the inductive rules “if Ts then Tt” and “if Tt then Ts”, i.e. P( Ts Tt). Total: 100 anchor-text Ts:Yahoo (only one anchor text) ; Tt: 雅虎 (10 anchor text ) 雅虎 Yahoo P( Tt | Ts) = 0.01/ 0.01 = 1 雅虎 動物 P( Ts Tt ) = 0.01/ [(0.01+0.1)-0.01] = 0.1 雅虎 企業 …………. 100
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Let U=(U 1,U 2,…,U n ) be a concept space (Web page space), consisting of a set of pair-wised disjoint basic concepts (Web pages), i.e., U i ∩U j = ∅ for i≠j. We can rewrite Eq.(2) as follows: Probabilistic Inference Model (cont.) where L(Uj) = the number of in-links of pages Uj Uj 15 L(Ui)
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 We assume that Ts and Tt are independent given Ui; then the joint probability P(Ts∩Tt|Ui) is equal to the product of P(Ts|Ui) and P(Tt|Ui) the above estimation approach considers the link information and degree of authority among Web pages. Probabilistic Inference Model (cont.)
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 three different methods to extract Chinese terms: PAT-tree-based 1. check if the strings of candidate terms are complete in a lexical boundary 2. decide the importance of a term, based on its relative frequency Query-set-based take queries from search engines query sets of different sizes Tagger-based use the CKIP’s tagger extract unknown words Query Translation System Yahoo 雅虎 搜尋、雅虎 雅虎 美國雅虎
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experimental Environment Collected popular query terms with the logs from Dreamer and GAIS. These query terms were taken as the major test set in our term translation extraction analysis. We filtered out the terms that had no corresponding Chinese translations in the anchor-text database and picked up 622 English terms as the source query set. Experiments
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Evaluation Metric For a set of test query terms, its top-n inclusion rate is defined as the percentage of the query terms whose effective translation (s) can be found in the top n extracted translations. Experiments (cont.)
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Performance with Various Similarity Estimation Models M A, Asymmetric model as M AL, Asymmetric model with link information: M S, Symmetric model as M SL, Symmetric model with link information as (the proposed model). 622 English query terms and query-set-based method Experiments (cont.)
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Performance with Various Term Extraction Methods use M SL as similarity estimation model Experiments (cont.) PAT-tree- based Query-set- based Tagger- based longer-translations ○○ X short-translations ○○ low-frequencyX ○○
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Performance with Various Query-Set Sizes medium-sized query set achieved the best performance. Example: "sakura" 9709 terms: 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakura); 蜘蛛網 (spiderweb); 純愛 (pure love); and 螢幕保護 (screen saving) 228,566 terms: 庫洛魔法使 (Card Captor Sakura); 櫻花建設 (Sakura Development Corporation); 模仿 (imitation); 櫻花大戰 (Sakura Wars); 美夕 (Miyu, name of an actresss); 台灣櫻花 (Taiwan Sakura Corporation); 櫻花 (sakura); 蜘蛛網 (spiderweb); 純 愛 (pure love); and 螢幕保護 (screen saving) Experiments (cont.) might also produce more noise
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Discussion Comparisons with a translation lexicon Queries suitable for finding translations Extracting domain-specific translations Experiments on Simplified Chinese pages
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 proposing a new and effective approach for mining Web link structures and anchor texts for translations of Web query terms. Future research combining more in-depth linguistic knowledge to remove noisy terms. Conclusion
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 …….. Personal Opinion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.