Download presentation
Presentation is loading. Please wait.
Published byBria Boram Modified over 10 years ago
1
Web Usage Mining Classification Fang Yao MEMS 2002 185029 Humboldt Uni zu Berlin
2
Contents: Defination and the Usages Outputs of Classification Methods of Classification Application to EDOC Discussion on Imcomplete Data Discussion questions & Outlook Humboldt Uni zu Berlin
3
Classification A Major Data Mining Operation Give one attribute (e.g play), try to predict the value of new people ’ s behavior by means of some other available attributes. Humboldt Uni zu Berlin Usages: Behavior predictions improve Web design personal marketing …… “People with age less than 40 and salary > 40k trade on-line”
4
A Small Example Humboldt Uni zu Berlin Weather Data Source: Witten & Frank, table 1.2 outlooktemperaturehumiditywindyplay sunnyhothighfalseno sunnyhothightureno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes rainycoolnormaltrue …. no Decision Tree outlook yes windy overcast rainy sunny false true …. humidity …. no high ….
5
Outputs of Classification Humboldt Uni zu Berlin Decision TreeClassification Rules If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes....... outlook yes windy overcast rainy sunny false true …. humidity …. no high ….
6
Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin Step 1: select a splitting attribute humidity Yes No high normal Yes No Gain(humidity) 0,152 bits windy Yes No true false Yes No Gain(windy) 0,048 bits temp. Yes No hot mild cool Yes yes No Yes No Gain(temperature): 0,029 bits > > outlook Yes overcast rainy sunny Yes No Yes No Gain(outlook): 0,247 bits >
7
Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin Calculation information gain: Gain(outlook) = info([9,5]) – info([4,0],[3,2],[2,3])=0,247 bits info([4,0],[3,2],[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3]) Where: Informational value of creating a branch on the „ outlook “ outlook Yes overcast rainy sunny Yes No Yes No
8
Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin Formula for information value : nnn pppppppppentropylog...log ),...,,( 221121 Logarithms are expressed in base 2. unit is ‚bits‘ argument p is expressed as fraction that add up to 1. Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97 Example :
9
Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin
10
Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin outlook yes overcast rainy sunny ??
11
Methods _ divide-and-conquer Humboldt Uni zu Berlin Step 2: select a daughter attribute___ outlook = sunny humidity No high normal Yes Gain(humidity) 0,971 bits temp. No hot mild cool Yes No Yes Gain(temperature): 0,571 bits > windy Yes No truefalse Yes No Gain(windy) 0,020 bits > Do this recursively !!!
12
Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin outlook yes windy overcast rainy sunny humidity no high no true yes false yes normal Stop rules: stop when all leaf nodes are pure stop when no more attribute can be splited
13
Methods _ C 4.5 Humboldt Uni zu Berlin WHY C 4.5? The real -world data is more Complicated Numeric attributes Missing values Final solution need more Operations Pruning From trees to rules
14
Methods _ C 4.5 Humboldt Uni zu Berlin Numeric attributes: binary split with numeric thresholds halfway between the values Missing values: -- Ignoring leads to losing information -- Partial instances Pruning decision tree: -- subtree replacement -- subtree raising A C A B C A B
15
Application in WEKA Humboldt Uni zu Berlin
16
Application in WEKA Humboldt Uni zu Berlin Data: Clickstream from log of EDOC on 30 th March Method: J4.8 Algorithm Objective: Prediction of dissertation reading Attributes: HOME {1,0} AU-START {1,0} DSS-LOOKUP {1,0} SH-OTHER {1,0} OTHER {1,0} AUHINWEISE {1,0} DSS-RVK {1,0} AUTBERATUNG {1,0} DSS-ABSTR {1,0} HIST-DISS {1,0} OT-PUB-READ {1,0} OT-CONF {1,0} SH-START {1,0} SH-DOCSERV {1,0} SH-DISS {1,0} OT-BOOKS {1,0} SH-START-E {1,0}
17
Application in WEKA Humboldt Uni zu Berlin Result: DSS-ABSTR
18
Application in WEKA Humboldt Uni zu Berlin DSS-Lookup
19
Discussion on Incomplete data Humboldt Uni zu Berlin Idea: Site-centric data v.s. User-centric data Incomplete data are inferior to the one from Complete data. User-centric data : Site-centric data : Example: User1: Expedia1, Expedia2, Expedia3 User2: Expedia1, Expedia2, Expedia3, Expedia4 User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001)
20
Humboldt Uni zu Berlin Results: Lift curve source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9 Discussion on Incomplete data
21
Humboldt Uni zu Berlin Discussion Questions & Outlook What is the proper target attribute for an analysis of non-profit site? What data do we prefer to have? Which improvement could be made to the data?
22
Humboldt Uni zu Berlin References: Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“ http://www.cs.cmu.edu/~awm/tutorials
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.