Incremental Mining of Information Interest for Personalized Web Scanning Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University
2 Problem Definition Personalized web scanning An environmental scanning routine for users and businesses A resource-consuming job (e.g. network bandwidth) Key issues Seed finding Information crawling Information monitoring Should be guided by proper information interest, which is both Implicit: The user is unable and/or unwilling to express the interest, and Evolving: The interest may change although it is relatively long- term
3 Spec. for user’s interest Scanner Seed Finding User Personalized Folder CRCR C n2 2 Cn2Cn2 Cn1Cn1 CnCn C1212C1212 C1211C1211 C11C11 C12C12 C1C1 C 11 2 C12C12 C11C11 C1212C1212 C1211C1211 C1C1 C1212C1212 C1211C1211 C1212C1212 C1211C1211 C1212C1212 C12C12 New Info Interest designation Info Scanned New Info Gathering & Monitoring Interest Miner The Web Info Scanned { AND …} Our goal: Incremental mining of information interest to guide web scanning
4 Related Fields Information gathering Aimed at “one-shot” information needs, rather than relatively long-term needs Information monitoring Aimed at the “dynamics” of information of interest (IOI), rather than the location of the IOI Profile building for folders (categories) Aimed at information analysis (e.g. information classification and similarity measurement), rather than the derivation of comprehensible specifications
5 Major Challenges Interest specifications should be both Precise To direct the scanner to suitable info subspaces Comprehensible To allow the user to refine the specifications, and To allow the search engines to find proper seeds for scanning The specifications should be derived under the common condition that the user’s interest is often Implicit, Evolving, and Collectively defined by a hierarchy of folders in which each folder’s context of discussion (COD) is implicitly expressed Example: Root System Development Decision Support Systems Root Manufacturing Decision Support Systems A folder’s COD is actually indicated by the profiles of its ancestors.
6 IMind Main contributions Incrementally mining interest specifications which are more Precise (by specifying each folder’s COD), and Comprehensible (in conjunctive normal form) No predefined feature sets
7 Input A hierarchy T of folders, A set of folders G designated as the goals of web scanning, and A set X of documents added to a folder f. Output Update the profile of each related folder of f in T, For each folder g in G, if the interest specification of g has changed, send the new specification to the scanner.
8 Example output of IMind card, machine, PC, sound, printer, … CPU, bit, instruction, register, processor, chip, … file, information, window, system, site, server, … … … … … …… Computer & Internet Hardware Desktop Computers Root The interest specification for Desktop Computers: (file OR information OR window OR system OR site OR server OR …) AND ( CPU OR bit OR instruction OR register OR processor OR chip OR … ) AND ( card OR machine OR PC OR sound OR printer OR … ).
9 The algorithm (1) W {w | w is a word in X, and w is not a stop word}; (2) While (f is not the root of T) do (2.1) Construct or update each 3-tuple in the profile of f; (2.2) For each sibling b of f, update d w,b ; (2.3) f parent of f; (3) For each goal folder g in G, do (3.1) I g Disjunction of the profile terms having higher r w,g d w,g values (a number of profile terms in g are selected); (3.2) a parent of g; (3.3) While (a is not the root of T) do (3.3.1) I g Conjunction of I g and disjunction of the terms having higher r w,a d w,a values (a number of profile terms in both a and g are selected); (3.3.2) a parent of a; (3.4) If I g specification of g, send I g to the scanner to update the specification of g; Incremental update of folder profiles Derivation of interest specifications
10 Measuring how representative and discriminative a term w is in a folder f: r w,f = Support(w,f) (= P(w|f)) d w,f = Support(w,f) / Avg Support(w,f i ), where f i is in {f } U {siblings of f} … … … … …… System, Computer, Analysis, …(O) Systems Development Decision, simulation,… (O) System, Computer, … (X) Decision Support Systems Transaction Processing Systems Accounting, Sales … (O) System, Computer, … (X) Product, factory, …(O) Manufacturing Decision, simulation, … (O) Decision Support Systems
11 Incremental update of profile terms f Both r-values and d- values of the profile terms are updated ‧‧‧ Only d-values of the terms are updated X: the set of documents added to f
12 Complexity of Incremental Mining Space complexity O(N t), where N is the total number of different terms accumulated, and t is the number of folders in the hierarchy Time complexity Profile mining (step 2) The maximum number of updates is i B i N, where B i is the number of siblings of the level-i ancestor of f (i.e. the ancestor whose level is i) plus one (i.e. including the level-i ancestor) Specification derivation (step 3) The maximum number of operations required to update interest specifications is i j i,j N, where i,j is the number of descendant goal folders of the j th sibling of the level-i ancestor of f Note: The above numbers should be much smaller in practice, since each folder is quite unlikely to contain all terms (i.e. N terms)
13 Empirical Evaluation Experimental Data Source: Yahoo! ( Coverage: Computers & Internet, Society and Culture, and Science The larger hierarchy: 261 folders, among which 174 were leaf folders, among which 142 are not duplicate (and set as goal folders) 2844 documents The smaller hierarchy: 169 folders, among which 119 were leaf folders, among which 109 are not duplicate (and set as goal folders) 3615 documents
14 Evaluation method Sending the specifications to Yahoo! Other search engines were tried as well. However, they limited the number of terms in a query and/or did not return the category of the web sites Google ( Lycos ( Open Directory Project (ODP, AltaVista ( andhttp:// Netscape ( ) Yahoo! returns web sites and their categories Top 200 web sites are considered In practice, the web scanner may process only a limited number of seeds Yahoo! claims to sort the relevance of each web site by her complicated and proprietary algorithm
15 Evaluation criteria Completeness Average sites found per folder Reliability Percentage of folders with sites retrieved
16 Systems evaluated IMind (with = 10 and 20) Baselines (with the same number of terms as IMind) Vector-based approach Norm-of-the-folder (NOF) The profile of the folder was a vector constructed by averaging the document vectors in the folder Rocchio’s method (RO) The profile was a vector constructed by computing a weighted sum of the positive document vectors and the negative document vectors Probability-based approach Naive Bayes (NB) The profile was constructed by estimating the conditional probabilities of the terms in the folder Hierarchical approach Hierarchical Shrinkage (HS) The profile was constructed by employing the hierarchical relationships (e.g. sibling) among folders to refine the estimates of the conditional probabilities produced by NB
17 Results Average sites found per folder (the larger hierarchy)
18 Average sites found per folder (the smaller hierarchy)
19 Percentage of folders with sites retrieved (the larger hierarchy)
20 Percentage of folders with sites retrieved (the smaller hierarchy)
21 More specially, the results showed that IMind derived more precise specifications Making seed finding both more complete and reliable Some specifications derived by the baselines were too vague for Yahoo! to process Yahoo! did not respond to 2, 3, 19, and 78 queries generated by RO-20, NOF-20, NB-20, and HS-20, respectively IMind derived more comprehensible specifications Specifying each level of COD of each folder IMind improved more when more training data was given Contributing more significant improvements on the smaller hierarchy, which has more training documents IMind does not require feature set tuning Demonstrating more stable performance
22 IMind successfully controlled the time spent to process each document The time mainly depends on the number of terms in related folders, while the number should converge to a certain limit Time spent for individual documents sequentially added into the larger hierarchy (running on a PC with a CPU running in 2.6 GHz and a RAM whose size was 2 GB)
23 Conclusion Personalized web scanning needs to be guided by the user’s information interest, which is both implicit and evolving IMind is an incremental text mining system to derive precise and comprehensible interest specifications
24 Extension How can the user refine the specifications mined? An intelligent interface to guide the refinement How can the length of the specifications be determined more intelligently? Automatic thresholding Manual setting
25 More related extensions Information Scanning: Autonomous scanning, Adaptive discovery, Adaptive monitoring, & Adaptive elicitation Information Analysis: Exception management, Trend detection, Association detection, Even tracking, & Novelty detection Information/Knowledge Classification & Filtering: Semantic context recognition, Integrated filtering and classification, & Incremental context mining Environmental Information: Partners, Customers, Competitors, Government, & News providers Internal Information: Transaction Data, Knowledge shared, & Information shared Information/Knowledge Delivery: Intelligent information retrieval, Adaptive online guidance, Adaptive dissemination, People finding, Knowledge finding, Knowledge map, & Computer-Assisted Instruction
Thanks