CSE Data Mining, 2002Lecture 11.1 Data Mining - CSE5230 Web Mining CSE5230/DMS/2002/11
CSE Data Mining, 2002Lecture 11.2 Lecture Outline u How big is the web? u What is “web data”? u A taxonomy of web mining tasks u Example: targeted advertising u Example: personalization u References
CSE Data Mining, 2002Lecture 11.3 How big is the web? u It is not easy to determine the size of the web vIn 1999, one estimate was that there were approximately 350 million web pages, growing at about 1 million pages per day vIn 2001, Google announced that they were indexing around 3 billion web documents u No matter which of these is more accurate – it’s very big! u We can view the web as the world’s biggest database vThe word “database” is used loosely here, because the web has no real formal structure or database schema »This makes the application of data mining to the web potentially very useful, but also difficult
CSE Data Mining, 2002Lecture 11.4 What is “web data”? u Web data can be classified as follows [Dun2002]: vThe actual content of web pages (text, images, multimedia) vIntrapage structure – the HTML or XML mark-up specifying the organization of the page content vInterpage structure – the links into and out of web pages vUsage data describing how the users of a web site access pages – navigation patterns vUser profiles – these can include demographic data obtained from a registration process, or perhaps IP addresses. It can also include information found in cookies
CSE Data Mining, 2002Lecture 11.5 A taxonomy of web mining tasks (1) u From [Dun2002], following [Zai1999]. Web Content Mining Web Mining Web Usage Mining Web Structure Mining Web Page Content Mining Search Result Mining General Access Pattern Tracking Customized Usage Tracking
CSE Data Mining, 2002Lecture 11.6 A taxonomy of web mining tasks (2) u Web content mining vExamines the contents of web pages (text, graphics) vExamines the results of web searches »Mining systems built on top of existing search engines vSimilar to traditional information retrieval (text categoriation, text filtering, etc.) »Often goes further than simple keyword search – e.g. may cluster similar pages u Web structure mining vLooks at page structure »e.g. text in tags may be more important vLinks between pages »e.g. pages with many incoming links may be more useful
CSE Data Mining, 2002Lecture 11.7 A taxonomy of web mining tasks (3) u Web usage mining vLooks at log files of web access vGeneral access tracking looks at history of pages visited vCustomised usage tracking may be focused on particular kinds of usage, or particular users vInvolves mining of sequential patterns »Can use association rule discovery, or HMMs »These patterns can be clustered to reveal users with similar access behaviour vCan be used to »improve web site design »Customize presentation via collaborative filtering
CSE Data Mining, 2002Lecture 11.8 Example: targeted advertising (1) u In marketing, targeting is any technique used to direct marketing or advertising effort to the portion of the population thought to be most valuable to the business, e.g. those vLikely to purchase vLikely to spend a lot u The business wants to avoid spending money on sending advertising to people who will not respond to it u In the web context, this can mean displaying an add for a web site on a different web site u Can use web usage information to work out what kind of people use a site: target demographics vSell advertising to companies wanting to target that demographic
CSE Data Mining, 2002Lecture 11.9 Example: targeted advertising (2) u For example, the Rugby Heaven web site ( is today hosting advertising for: vMLC life insurance vFintrack Financial Services vBusiness Review Weekly (BRW) u They appear to think that this site is likely to be popular with older people who have money! u The URL for the BRW ad. is: egments=2,13,23,31,35,77,81,88,93,94,153,855,976,993,1145,1301,1989,2320,2389,2394,2396,2 477,2534,2576,2581,2689&Targets=535,2389,40,60,1834&Values=25,31,43,48,50,60,72,81,91,10 0,110,135,150,157,233,239,366,422,605,791,804,805,806,1203,1278,1403,1432,1476,1485,1499& RawValues=&Redirect= u It is clear that some sophisticated targeting is going on
CSE Data Mining, 2002Lecture Example: personalization (1) u Personalization spans the areas of web content mining and web usage mining u Personalization aims to modify document contents or access patterns to better match the preferences of a particular user u Personalization can involve vDynamically creating and serving web pages that are unique to an individual user vDetermining which pages to retrieve or link to on a user- by-user basis
CSE Data Mining, 2002Lecture Example: personalization (2) u Unlike targeting, with personalization can be done for the target web page (unlike a targeted advertisement for another site) vSimple example: including the name of the user in the page content u Personalization techniques include vUse of cookies vUse of user databases vUse of web usage patterns to identify similar users (for use in collaborative filtering) u Often requires a user to log in – this part is not data mining
CSE Data Mining, 2002Lecture Example: personalization (3) u A classic example of personalization is the recommending to a user of va product very similar to something they have bought before (if the web site is selling something) vContent that is similar to something they have used before u Personalization techniques can be based on clustering, classification or even prediction vWith classification, the desires of a user are determined based on the class to which he/she is assigned. Classes may be predetermined by experts. vWith clustering, clusters of users with similar navigation or purchasing behaviour are found, and the user’s desires are determined on this basis
CSE Data Mining, 2002Lecture Example: personalization (4) u Amazon.com makes use of personalization, as we will see in an on-line example u They make use of both the user’s past behaviour u They also use collaborative filtering – they recommend products bought by users who have similar profiles to the current user vCould use clustering, or information filtering techniques
CSE Data Mining, 2002Lecture References u [Dun2002] Margaret H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, Upper Saddle River, NJ, USA, 2002, pp u [Zai1999] Osmar R. Zaïane, Resource and Knowledge Discovery from the Internet and Multimedia Repositories, PhD Thesis, Simon Fraser University, Canada, March 1999.