Mining for Interactive Identification of Users’ Information Needs Rey-Long Liu and Wan-Jung Lin 劉瑞瓏ヽ林宛蓉 Dept. of Information Management Chung Hua University
2 Outline Introduction Information Need Identification (INI): What & Why Interactive INI INEED: Incremental Mining for Interactive INI The profile miner The information need identifier Experiment Conclusion
3 Introduction Information Need Identification (INI) for Information portals Online service guidance Internet search engines People finding Interactive INI, which needs to consider Precision (P) Precision Effectiveness (PE) Recall (R) Recall Effectiveness (RE) CRCR C n2 2 C n2 1 C n1 2 C n1 1 Cn2Cn2 Cn1Cn1 CnCn C1212C1212 C1211C1211 C 12 2 C 12 1 C11C11 C12C12 C1C1 C 11 2 C 11 1 C 12 2 C 12 1 C12C12 C11C11 C1212C1212 C1211C1211 C1C1 C1212C1212 C1211C1211 C1212C1212 C1211C1211 C n2 2 C1212C1212 C12C12 ‧‧ ‧
4 Introduction (Cont.) Main Challenges Each information space has its own content and structure. Each information space is intrinsically dynamic. Users are often unable (or unwilling) to precisely express their information needs (INs). Their queries are often quite short. Users prefer simpler and fewer interactions.
5 INEED (3) Information Information Storage Interface Information Provider (4) Information Required Profile Miner IN Identifier INEED Category Profile (0)Content & Taxonomy (2)Request (1)Interaction
6 The Profile Miner Incremental profile mining Given: The document d to be added to category c. Effect: Updating the profiles of c and related categories. Procedure: (1) While c is not the root of the text hierarchy, do (1.1) For each distinct word w in d, do (1.1.1) If w is not a profile term for c, add to the profile of c (strength s w,c is unknown); (1.2) For each pair in the profile of c, do (1.2.1) s w,c = P(w|c) (B c / i P(w|c i )); (1.2.2) For each sibling b of c, update s w,b in the profile of b; (1.3) c father of c.
7 The Profile Miner (Cont.) f Updating the profiles of related categories once a document is added New document added to f The s-values of the profile terms are updated ‧‧‧ The s-values of the profile terms are updated
8 The Profile Miner (Cont.) 經理人員 決策制定、協調整合 業務處 市場規劃、商品推展 管理處 內務行政、績效管理 研發處 整合評估、流程制定 行銷部 行銷文宣、廣告宣傳 客戶部 訂單管理、銷售分析 品保部 品質維護、產品測試 製造部 產品生產、設計製造 行政部 營運管理 資訊部 系統規劃、研發維護 人事課 員工聘用、人才培育 會計課 帳目管理、預算編排 出納課 款項收付 電腦整合課 生產資訊、資訊運用 資訊管理課 系統管理、辦公室自動化 An example:
9 管理處 內務、行政、管理 研發處 研發、生產、流程 品保部 品質、管理、測試 資訊部 資訊、系統、建置 電腦整合課 生產、整合、運用 … … … 生產管理之相關資訊 ? The Profile Miner (Cont.) 經理人員 業務處 市場、規劃、銷售 行銷部 行銷、廣告、宣傳 客戶部 訂單、管理、分析 具有代表性 P(w|c) 高 區別能力 P(w|c) * B c / i P(w|c i ) 強 S=P(w|c) * (B c / i P(w|c i ) 管理處 內務、行政、管理 研發處 研發、生產、流程 品保部 品質、管理、測試 資訊部 資訊、系統、建置 電腦整合課 生產、整合、運用 … … … 生產管理系統建 置與維護 生產品質維護 context
10 The IN Identifier
11 The IN Identifier (Cont.) (1) For each category c, HitScore c 0; (2) For each pair (w, c), where w is a word in the query Q and c is a category, (2.1) If s w,c > 1 and Support(w, c) minSupport, (2.1.1) ns (s w,c – 1) / (number of siblings of c); (2.1.2) HitScore c HitScore c + ns TF(w, Q); (3) S The set of all categories; (4) While the target category has not been identified and interaction is still allowed, do (4.1) Let p 1 and p 2 be two pedigrees (in S) with the highest average HitScore; (4.2) Let t 1 and t 2 be the categories with the highest HitScore in p 1 and p 2 ; (4.3) Display t 1 and t 2 (and their basic information) for the user to select; (4.4) If either t 1 or t 2 is exactly the target, return the space under the target; (4.5) Else if neither t 1 nor t 2 is of interest, S S – {the categories under t 1 and t 2 }; (4.6) Else if both t 1 and t 2 are of interest, g ClimbUp(common ancestor of t 1 and t 2 ), and return the space under g; (4.7) Else (4.7.1) Let t be the category that is of interest; (4.7.2) If t is a leaf, g ClimbUp(father of t), and return the space under g; (4.7.3) Else S {the categories under t}; (5) Return S;
12 The IN Identifier (Cont.) Finding two candidate categories for interaction (1) (2) (3) (4) (5) p1p1 p2p2 t1t1 t2t2
13 The IN Identifier (Cont.) Function ClimbUp(f), where f is a category to start climbing (1) If f is the root, return f; (2) While the target category has not been identified and interaction is still allowed, (2.1) f sibling A sibling of f; (2.2) f uncle A sibling of the father of f; (2.3) Display f sibling and f uncle (and their basic information) for the user to select; (2.4) If either f sibling or f uncle is exactly the target, return the target; (2.5) Else if neither f sibling nor f uncle is of interest, return f; (2.6) Else if both f sibling and f uncle are of interest, (2.6.1) f grandfather of f; (2.6.2) If f is the root, return f; (2.7) Else if f sibling is of interest, return father of f; (2.8) Else return {f, f uncle }; (3) Return f;
14 The IN Identifier (Cont.) Generalization by climbing the hierarchy Possible results of generalizationFinding two categories for generalization f sibling f uncle f
15 Experiment Experimental Data Source: Yahoo! ( Coverage: Computers & Internet, Society and Culture, and Science Size: 214 categories; depth: 8 Training data: 2216 documents Test data: 168 queries extracted from another set of site summaries
16 Experiment (Cont.) Each system could conduct at most 5 interactions for each query SystemDescriptionNote INEEDAs described with two settings for minSupport: and INEED INEED BruteForce As in most search engines, the whole information space is considered (no INI is conducted). RandomCN The system employs top-down navigation. At each level, two categories are randomly selected for the user to confirm. Repeat 10 times IdealCN The system employs top-down navigation. At each level, the target is always in the candidates identified by the system. NB The output category is determined by the conditional probabilities of the query terms occurring the categories, with two feature set sizes: 5000 and NB-5000 NB-8000
17 Experiment (Cont.) Precision BruteForce was poor Interaction is good for precision INEED improved 14%~20% w.r.t NB Recall INEED was good in both precision and recall BruteForce and CN achieved 100% recall INEED achieved 100% recall using only 2 interactions
18 Experiment (Cont.) Precision-effectiveness BruteForce was excluded INEED improved more (19%~32%) w.r.t. NB interactions by INEED were more effective Recall-effectiveness INEED performed best INEED improved 2%~20% w.r.t. NB
19 Experiment (Cont.) Precision vs.Recall BruteForec and CN always achieved 100% recall INEED performed best (its curve lied on the upper right corner) When no interaction is allowed INEED improved 38% recall w.r.t. NB Precision of INEED improved 62% in the first interaction (NB only improved 29%)
20 Experiment (Cont.) Test query: Virtual world featuring 3-D ray-traced graphics. Wander around, meet other netizens, and try to solve some puzzles. Features animation and sound clips, Correct target identified by INEED: Computers and Internet → Multimedia → Virtual Reality → Exhibits Erroneous category identified by NB: Computers and Internet → Software → Operating Systems → Windows → Windows 95 An example:
21 Conclusion Interactive Information Need Identification (interactive INI) as an essential component for Information portals Online service guidance Information retrieval People finding Requirements of interactive INI, fulfilled by INEED Exactly identify the information space that may satisfy the user’s information needs Effectively interact with the user Intelligently reduce the user’s load in query formation and result cognition
22 Thanks