Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin.

Similar presentations


Presentation on theme: "Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin."— Presentation transcript:

1 Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin

2 Contents: Defination and the Usages Outputs of Classification Methods of Classification Application to EDOC Discussion on Imcomplete Data Discussion questions & Outlook Humboldt Uni zu Berlin

3 Classification A Major Data Mining Operation Give one attribute (e.g play), try to predict the value of new people ’ s behavior by means of some other available attributes. Humboldt Uni zu Berlin Usages: Behavior predictions improve Web design personal marketing …… “People with age less than 40 and salary > 40k trade on-line”

4 A Small Example Humboldt Uni zu Berlin Weather Data Source: Witten & Frank, table 1.2 outlooktemperaturehumiditywindyplay sunnyhothighfalseno sunnyhothightureno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrue …. no Decision Tree outlook yes windy overcast rainy sunny false true …. humidity …. no high ….

5 Outputs of Classification Humboldt Uni zu Berlin Decision TreeClassification Rules If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes....... outlook yes windy overcast rainy sunny false true …. humidity …. no high ….

6 Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin Step 1: select a splitting attribute humidity Yes No high normal Yes No Gain(humidity) 0,152 bits windy Yes No true false Yes No Gain(windy) 0,048 bits temp. Yes No hot mild cool Yes yes No Yes No Gain(temperature): 0,029 bits > > outlook Yes overcast rainy sunny Yes No Yes No Gain(outlook): 0,247 bits >

7 Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin Calculation information gain: Gain(outlook) = info([9,5]) – info([4,0],[3,2],[2,3])=0,247 bits info([4,0],[3,2],[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3]) Where: Informational value of creating a branch on the „ outlook “ outlook Yes overcast rainy sunny Yes No Yes No

8 Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin Formula for information value : nnn pppppppppentropylog...log ),...,,( 221121  Logarithms are expressed in base 2. unit is ‚bits‘ argument p is expressed as fraction that add up to 1. Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97 Example :

9 Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin

10 Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin outlook yes overcast rainy sunny ??

11 Methods _ divide-and-conquer Humboldt Uni zu Berlin Step 2: select a daughter attribute___ outlook = sunny humidity No high normal Yes Gain(humidity) 0,971 bits temp. No hot mild cool Yes No Yes Gain(temperature): 0,571 bits > windy Yes No truefalse Yes No Gain(windy) 0,020 bits > Do this recursively !!!

12 Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin outlook yes windy overcast rainy sunny humidity no high no true yes false yes normal Stop rules: stop when all leaf nodes are pure stop when no more attribute can be splited

13 Methods _ C 4.5 Humboldt Uni zu Berlin WHY C 4.5? The real -world data is more Complicated Numeric attributes Missing values Final solution need more Operations Pruning From trees to rules

14 Methods _ C 4.5 Humboldt Uni zu Berlin Numeric attributes: binary split with numeric thresholds halfway between the values Missing values: -- Ignoring leads to losing information -- Partial instances Pruning decision tree: -- subtree replacement -- subtree raising A C A B C A B

15 Application in WEKA Humboldt Uni zu Berlin

16 Application in WEKA Humboldt Uni zu Berlin Data: Clickstream from log of EDOC on 30 th March Method: J4.8 Algorithm Objective: Prediction of dissertation reading Attributes: HOME {1,0} AU-START {1,0} DSS-LOOKUP {1,0} SH-OTHER {1,0} OTHER {1,0} AUHINWEISE {1,0} DSS-RVK {1,0} AUTBERATUNG {1,0} DSS-ABSTR {1,0} HIST-DISS {1,0} OT-PUB-READ {1,0} OT-CONF {1,0} SH-START {1,0} SH-DOCSERV {1,0} SH-DISS {1,0} OT-BOOKS {1,0} SH-START-E {1,0}

17 Application in WEKA Humboldt Uni zu Berlin Result: DSS-ABSTR

18 Application in WEKA Humboldt Uni zu Berlin DSS-Lookup

19 Discussion on Incomplete data Humboldt Uni zu Berlin Idea: Site-centric data v.s. User-centric data Incomplete data are inferior to the one from Complete data. User-centric data : Site-centric data : Example: User1: Expedia1, Expedia2, Expedia3 User2: Expedia1, Expedia2, Expedia3, Expedia4 User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001)

20 Humboldt Uni zu Berlin Results: Lift curve source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9 Discussion on Incomplete data

21 Humboldt Uni zu Berlin Discussion Questions & Outlook What is the proper target attribute for an analysis of non-profit site? What data do we prefer to have? Which improvement could be made to the data?

22 Humboldt Uni zu Berlin References: Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“ http://www.cs.cmu.edu/~awm/tutorials


Download ppt "Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin."

Similar presentations


Ads by Google