Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Text Categorization.
Chapter 5: Introduction to Information Retrieval
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
適性化多代理人網際網路環境資訊偵搜 Collaborative Multiagent Adaptation for Business Environmental Scanning through the Internet 劉瑞瓏 Rey-Long Liu 中華大學資訊管理系 中華民國 92 年 11.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Mobile Web Search Personalization Kapil Goenka. Outline Introduction & Background Methodology Evaluation Future Work Conclusion.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Internet Research Search Engines & Subject Directories.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Search Engines and Information Retrieval Chapter 1.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Gradual Adaption Model for Estimation of User Information Access Behavior J. Chen, R.Y. Shtykh and Q. Jin Graduate School of Human Sciences, Waseda University,
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Querying Structured Text in an XML Database By Xuemei Luo.
WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.
1 Text Classification for Healthcare Information Support Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Chapter 3 DECISION SUPPORT SYSTEMS CONCEPTS, METHODOLOGIES, AND TECHNOLOGIES: AN OVERVIEW Study sub-sections: , 3.12(p )
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Presenter: Shanshan Lu 03/04/2010
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.
Facilitating Document Annotation using Content and Querying Value.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
Retrieval of Highly Related Biomedical References by Key Passages of Citations Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Enhancing Text Classifiers to Identify Disease Aspect Information Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Facilitating Document Annotation Using Content and Querying Value.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Search Engines & Subject Directories
Information Retrieval
MANAGING DATA RESOURCES
Search Engines & Subject Directories
Search Engines & Subject Directories
Dynamic Category Profiling for Text Filtering and Classification
LO2 – Understand Computer Software
Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan
Presentation transcript:

Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University

2 Problem Definition Personalized web scanning  An environmental scanning routine for users and businesses  A resource-consuming job (e.g. network bandwidth)  Key issues Seed finding Information crawling Information monitoring  Should be guided by proper information interest, which is both Implicit: The user is unable and/or unwilling to express the interest, and Evolving: The interest may change although it is relatively long- term

3 Spec. for user’s interest Scanner Seed Finding User Personalized Folder CRCR C n2 2 Cn2Cn2 Cn1Cn1 CnCn C1212C1212 C1211C1211 C11C11 C12C12 C1C1 C 11 2 C12C12 C11C11 C1212C1212 C1211C1211 C1C1 C1212C1212 C1211C1211 C1212C1212 C1211C1211 C1212C1212 C12C12 New Info Interest designation Info Scanned New Info Gathering & Monitoring Interest Miner The Web Info Scanned { AND …} Our goal: Incremental mining of information interest to guide web scanning

4 Related Fields Information gathering  Aimed at “one-shot” information needs, rather than relatively long-term needs Information monitoring  Aimed at the “dynamics” of information of interest (IOI), rather than the location of the IOI Profile building for folders (categories)  Aimed at information analysis (e.g. information classification and similarity measurement), rather than the derivation of comprehensible specifications

5 Major Challenges Interest specifications should be both  Precise To direct the scanner to suitable info subspaces  Comprehensible To allow the user to refine the specifications, and To allow the search engines to find proper seeds for scanning The specifications should be derived under the common condition that the user’s interest is often  Implicit,  Evolving, and  Collectively defined by a hierarchy of folders in which each folder’s context of discussion (COD) is implicitly expressed Example: Root  System Development  Decision Support Systems Root  Manufacturing  Decision Support Systems  A folder’s COD is actually indicated by the profiles of its ancestors.

6 IMind Main contributions  Incrementally mining interest specifications which are more Precise (by specifying each folder’s COD), and Comprehensible (in conjunctive normal form)  No predefined feature sets

7 Input  A hierarchy T of folders,  A set of folders G designated as the goals of web scanning, and  A set X of documents added to a folder f. Output  Update the profile of each related folder of f in T,  For each folder g in G, if the interest specification of g has changed, send the new specification to the scanner.

8  Example output of IMind card, machine, PC, sound, printer, … CPU, bit, instruction, register, processor, chip, … file, information, window, system, site, server, … … … … … …… Computer & Internet Hardware Desktop Computers Root  The interest specification for Desktop Computers: (file OR information OR window OR system OR site OR server OR …) AND ( CPU OR bit OR instruction OR register OR processor OR chip OR … ) AND ( card OR machine OR PC OR sound OR printer OR … ).

9 The algorithm (1) W  {w | w is a word in X, and w is not a stop word}; (2) While (f is not the root of T) do (2.1) Construct or update each 3-tuple in the profile of f; (2.2) For each sibling b of f, update d w,b ; (2.3) f  parent of f; (3) For each goal folder g in G, do (3.1) I g  Disjunction of the profile terms having higher r w,g  d w,g values (a number  of profile terms in g are selected); (3.2) a  parent of g; (3.3) While (a is not the root of T) do (3.3.1) I g  Conjunction of I g and disjunction of the terms having higher r w,a  d w,a values (a number  of profile terms in both a and g are selected); (3.3.2) a  parent of a; (3.4) If I g  specification of g, send I g to the scanner to update the specification of g; Incremental update of folder profiles Derivation of interest specifications

10 Measuring how representative and discriminative a term w is in a folder f: r w,f = Support(w,f) (= P(w|f)) d w,f = Support(w,f) / Avg Support(w,f i ), where f i is in {f } U {siblings of f} … … … … …… System, Computer, Analysis, …(O) Systems Development Decision, simulation,… (O) System, Computer, … (X) Decision Support Systems Transaction Processing Systems Accounting, Sales … (O) System, Computer, … (X) Product, factory, …(O) Manufacturing Decision, simulation, … (O) Decision Support Systems

11 Incremental update of profile terms f Both r-values and d- values of the profile terms are updated ‧‧‧ Only d-values of the terms are updated X: the set of documents added to f

12 Complexity of Incremental Mining Space complexity  O(N  t), where N is the total number of different terms accumulated, and t is the number of folders in the hierarchy Time complexity  Profile mining (step 2) The maximum number of updates is  i B i N, where  B i is the number of siblings of the level-i ancestor of f (i.e. the ancestor whose level is i) plus one (i.e. including the level-i ancestor)  Specification derivation (step 3) The maximum number of operations required to update interest specifications is  i  j  i,j N, where   i,j is the number of descendant goal folders of the j th sibling of the level-i ancestor of f Note: The above numbers should be much smaller in practice, since each folder is quite unlikely to contain all terms (i.e. N terms)

13 Empirical Evaluation Experimental Data  Source: Yahoo! (  Coverage: Computers & Internet, Society and Culture, and Science  The larger hierarchy: 261 folders, among which 174 were leaf folders, among which 142 are not duplicate (and set as goal folders) 2844 documents  The smaller hierarchy: 169 folders, among which 119 were leaf folders, among which 109 are not duplicate (and set as goal folders) 3615 documents

14 Evaluation method  Sending the specifications to Yahoo! Other search engines were tried as well. However, they limited the number of terms in a query and/or did not return the category of the web sites  Google (  Lycos (  Open Directory Project (ODP,  AltaVista ( andhttp://  Netscape ( )  Yahoo! returns web sites and their categories  Top 200 web sites are considered In practice, the web scanner may process only a limited number of seeds Yahoo! claims to sort the relevance of each web site by her complicated and proprietary algorithm

15 Evaluation criteria  Completeness Average sites found per folder  Reliability Percentage of folders with sites retrieved

16 Systems evaluated  IMind (with  = 10 and 20)  Baselines (with the same number of terms as IMind) Vector-based approach  Norm-of-the-folder (NOF)  The profile of the folder was a vector constructed by averaging the document vectors in the folder  Rocchio’s method (RO)  The profile was a vector constructed by computing a weighted sum of the positive document vectors and the negative document vectors Probability-based approach  Naive Bayes (NB)  The profile was constructed by estimating the conditional probabilities of the terms in the folder Hierarchical approach  Hierarchical Shrinkage (HS)  The profile was constructed by employing the hierarchical relationships (e.g. sibling) among folders to refine the estimates of the conditional probabilities produced by NB

17 Results Average sites found per folder (the larger hierarchy)

18 Average sites found per folder (the smaller hierarchy)

19 Percentage of folders with sites retrieved (the larger hierarchy)

20 Percentage of folders with sites retrieved (the smaller hierarchy)

21 More specially, the results showed that  IMind derived more precise specifications Making seed finding both more complete and reliable Some specifications derived by the baselines were too vague for Yahoo! to process  Yahoo! did not respond to 2, 3, 19, and 78 queries generated by RO-20, NOF-20, NB-20, and HS-20, respectively  IMind derived more comprehensible specifications Specifying each level of COD of each folder  IMind improved more when more training data was given Contributing more significant improvements on the smaller hierarchy, which has more training documents  IMind does not require feature set tuning Demonstrating more stable performance

22  IMind successfully controlled the time spent to process each document The time mainly depends on the number of terms in related folders, while the number should converge to a certain limit Time spent for individual documents sequentially added into the larger hierarchy (running on a PC with a CPU running in 2.6 GHz and a RAM whose size was 2 GB)

23 Conclusion Personalized web scanning needs to be guided by the user’s information interest, which is both implicit and evolving IMind is an incremental text mining system to derive precise and comprehensible interest specifications

24 Extension How can the user refine the specifications mined?  An intelligent interface to guide the refinement How can the length of the specifications be determined more intelligently?  Automatic thresholding  Manual setting

25 More related extensions Information Scanning: Autonomous scanning, Adaptive discovery, Adaptive monitoring, & Adaptive elicitation Information Analysis: Exception management, Trend detection, Association detection, Even tracking, & Novelty detection Information/Knowledge Classification & Filtering: Semantic context recognition, Integrated filtering and classification, & Incremental context mining Environmental Information: Partners, Customers, Competitors, Government, & News providers Internal Information: Transaction Data, Knowledge shared, & Information shared Information/Knowledge Delivery: Intelligent information retrieval, Adaptive online guidance, Adaptive dissemination, People finding, Knowledge finding, Knowledge map, & Computer-Assisted Instruction

Thanks