Presentation is loading. Please wait.

Presentation is loading. Please wait.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.

Similar presentations


Presentation on theme: "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009."— Presentation transcript:

1 M. Sulaiman Khan (mskhan@liv.ac.uk)‏ Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 1 COMP527: Data Mining

2 Introduction to the Course Introduction to Data Mining Introduction to Text Mining General Data Mining Issues Data Warehousing Classification: Challenges, Basics Classification: Rules Classification: Trees Classification: Trees 2 Classification: Bayes Classification: Neural Networks Classification: SVM Classification: Evaluation Classification: Evaluation 2 Regression, Prediction COMP527: Data Mining Association Rule Mining March 5, 2009 Slide 2 COMP527: Data Mining Input Preprocessing Attribute Selection Association Rule Mining ARM: A Priori and Data Structures ARM: Improvements ARM: Advanced Techniques Clustering: Challenges, Basics Clustering: Improvements Clustering: Advanced Algorithms Hybrid Approaches Graph Mining, Web Mining Text Mining: Challenges, Basics Text Mining: Text-as-Data Text Mining: Text-as-Language Revision for Exam

3 Introduction to Association Rule Mining (ARM)‏ General Issues Support Confidence Lift Conviction Complexity! Frequent Itemsets Today's Topics Association Rule Mining March 5, 2009 Slide 3 COMP527: Data Mining

4 We've spent a long time looking at various classification methods, but there's more to data mining than classification. Given a data set with no classes, just attributes, what might we want to do with it? Association Rule Mining: Find patterns in the attribute values between instances. Instead of predicting an unknown value, we want to find interesting facts about the relationships between the known values. Introduction Association Rule Mining March 5, 2009 Slide 4 COMP527: Data Mining

5 In ARM, these patterns take the form of rules about the co- occurrence of attributes. The easiest example to use is market basket analysis -- finding patterns of things that are bought together in a supermarket. Shopping at a supermarket, you typically buy many things together (as opposed to shopping for a television, say). Perhaps 30 different items. Under 10 items is pretty rare. By comparing your shopping habits over time, the supermarket can learn about you and how best to make you spend more money, increasing their profits. They can also compare all shoppers' habits to find general rules, hopefully for how to increase profits. Introduction Association Rule Mining March 5, 2009 Slide 5 COMP527: Data Mining

6 Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk What can we find from this? Some simple statistics: bread occurs 80% of the time. butter appears 60% of the time. Less simple: 100% of baskets containing butter also contain bread. 100% of baskets containing butter and jam also contain bread. Introduction Association Rule Mining March 5, 2009 Slide 6 COMP527: Data Mining

7 Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk if (butter jam) then bread if butter then bread if bread then butter To find rules we find sets of items which occur together. The more frequently they occur, the better our rule is. There are some particular factors involved in determining the 'goodness' of a rule... Finding Rules Association Rule Mining March 5, 2009 Slide 7 COMP527: Data Mining

8 Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk Support: Percentage of baskets in which the item(s) occur. bread: 80%, butter 60%, (bread butter) 60%... So the support for a rule X => Y, is the percentage of instances which contain both X and Y. Support Association Rule Mining March 5, 2009 Slide 8 COMP527: Data Mining

9 Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk We also need a confidence for each rule -- how strongly we believe that rule to be true. Here, butter => bread is true 100% of the time, but bread => butter is only true for 3/4 baskets that contain bread so true 75% of the time. Confidence for X => Y is number of instances that contain X and Y divided by the number of instances that contain X. Confidence Association Rule Mining March 5, 2009 Slide 9 COMP527: Data Mining

10 Basket1: bread, butter, jam Basket2: bread, butter Basket3: bread, butter, milk Basket4: beer, bread Basket5: beer, milk ARM algorithms have a minimum threshold for both support and confidence and discard any rules below those thresholds. For example jam => (butter bread) has 100% confidence, but only 20% support, because jam butter and bread only occur once. On the other hand butter => bread has 60% support and 100% confidence, a much more interesting rule to us. Rule Mining Association Rule Mining March 5, 2009 Slide 10 COMP527: Data Mining

11 Confidence and Support are necessary but not sufficient to find interesting rules. Suppose that X => Y has a confidence of 60%. (X+Y)/X = 0.6 Sure, that looks interesting... there's a correlation between buying X and buying Y. But what if the probability of Y was 70% overall? Then if you buy X, you're less likely than normal to buy Y... certainly not what the rule is implying! Lift Association Rule Mining March 5, 2009 Slide 11 COMP527: Data Mining

12 Lift is measured in terms of support: s(X+Y) / s(X) * s(Y)‏ This would then take into account the likelihood of Y. This penalises 'obvious' rules where both X and Y are common. For example bread => milk... if 90% of baskets contain bread and 85% of baskets contain milk, then the worst that bread=>milk could be is 75%. (10% of baskets don't contain bread but do contain milk, 15% don't contain milk but do contain bread, therefore at least 75% must contain both. The maximum is 85%, where all baskets with milk have bread, 5% have just bread and 10% have neither)‏ Lift Association Rule Mining March 5, 2009 Slide 12 COMP527: Data Mining

13 Lift: s(X+Y) / s(X) * s(Y)‏ if the support for X is 0.25, Y is 0.7, and X+Y is 0.15 then we have: 0.15 / (0.25 * 0.7) = 0.857 Because this is less than 1, there is a negative correlation. 0.75 / (0.85 * 0.90) = 0.98 --> Negative lift 0.85 / (0.85 * 0.90) = 1.111 --> Positive lift Break even point is 0.765 Lift Association Rule Mining March 5, 2009 Slide 13 COMP527: Data Mining

14 We can express this in just terms of baskets that contain A but not B. “if A then B” implies “not (A and not B)” So the formula for conviction is: s(A) s(not B) / s(A and not B)‏ If A and B always co-occur, the denominator will be 0. Splat. (treat as infinite)‏ Conviction Association Rule Mining March 5, 2009 Slide 14 COMP527: Data Mining

15 Other Evaluation Metrics Association Rule Mining March 5, 2009 Slide 15 COMP527: Data Mining

16 The most common approach to finding rules is: 1. Find sets of 2 or more attributes that occur together in more instances than a minimum support threshold. 2. Generate rules from those sets. The most important thing to note is that any subset of a frequent item set is also frequent. If (bread, milk, butter, beer) is frequent, then (bread, butter, beer) is also frequent because it must occur as least as often as the full set. Back to Rule Mining Association Rule Mining March 5, 2009 Slide 16 COMP527: Data Mining

17 No problem. Algorithm is obvious: Count all possible itemsets that appear in all transactions. If our transactions are: BC, BD, AC, BCD, ABD, ABCD We count: AB, AC, AD, BC, BD, CD, ABC, ABD, ACD, BCD, ABCD Uhh... And when you have the number of different items as a supermarket?? Say 100,000 different products? Ignoring empty set and the single item sets, that's 2 100000 -100000 -1... You want to know how many that is? Naïve Approach Association Rule Mining March 5, 2009 Slide 17 COMP527: Data Mining

18 Naïve Approach: BAD!!! Association Rule Mining March 5, 2009 Slide 18 COMP527: Data Mining 99900209301438450794403276433003359098042913905418169177152927386314583246425734832748733133244965040316439444555585493001879966076561765629084713542474928751988896298736710932463504273731124792658002785312410887 370856052872283901645686910268506759235179146970528576446968015248323454755432502927865208069577709717411022320429763512053307779968979251166198707717857759555217200813202952046179492292592956239209657978 735581586675254957973131448062492602618379413050805826860315351341787396228349908863577580621046066363721305877953223449720108084863695414018358513598580356035740218729081555665806071864612689728397946218 422675793496388933572475887619591376567624111250207087048704651793963987101092003639347456180906016133778985602968635985580247614489330470522228601313770959583573194858984964045723838751707022423326334368 944232973818777331532869442179361253019078689036036632831615027261399341528040711719149239033418749353944558963012921972564177172335435447515523793108922681824024527557520947046421859438628656327442313320 847422215514933150027177500642288262118225493496005574573349646784832691809518959557691745096732244177404328404558821091379053756467721399766217852650571698548345624875183223832503186455054721143699341679 816781702551228129780651948062954053391546574799412974991903485075443364145056316573960066933824273164340395801212802609842122475142078347122248314103040686037196401618557416564394722534649452497003145098 900931622689527444287054764254722531675145211822314553883743082326422006330251375331293651643417252062561553117947386191429047614456549271284181751835313270529754953705614382395732279396730301060774568484 774278321953492279838364361637647429695459066723691241363259321233356431358944652191018821238297409079163860232354509593887667364032295779939011521544480036372150691155911119960015305891077294210322304242 620356934932160529275696258584458223545946452769231081973058062803265167364493437617324097533423332897302829591735692730132864233117596052304951716770331637095222569524604021433876551976440165281480223483 318810975594219604764793885201985410173489859485110054692466172341431353099384059232689535865388869744270086070286355020855620295493524800507965215649196832651067441009678229519541616177175429975200098873 073778762106858907709694116104380286239504453237895918707602892603934898261007748876728529181064684891438936490647845912116121933007079005370590421880128565594036990708880329668716116559612323319983109232 250828661803218804394475729867620969358197843859279692501233269351946932077243355273655662482237878338880749992768316334403186044636187037897843130328438234704109443065914719283411909751852392123276743849 905615636884329390394420026175309768506051329371014490863961416205560535473355699267009413752718291424072342679375650697655674759341013102253428300804090795873295442135513073020501715984242307604692097329 072901416063539608805592023573768856478522400927771114891344924169956071717862984365339781808694741067511113535237115404365993108896974856588008878619749343579292462040517672460122506184040119662898726738 030704983612179744846791007478463561946648292247361341151355671792917819680560537264841411283478582412591219546011844124093497829633170420025304186616949623187358606524854102222118695442237882891897120805 145751413619648053697231645705649984795376571745481285974060773391587753323552156094359192751993510142222469630170137174193375049192953632951011152929518362828191918216516764559465158280489842561167481503 678052678786627169996492969493770457948761466281109299820207370133303244510053853785511888034741481986651145793226849009930002367361685552941734420599253719652449979254831593437063439703718096114703230741 869850350547222890271748503333683283002811329108416931504573899331839345932929949427960153097561187089189295284490742432847670062431711716227317666067961019678022045645890158995247047410011581109636337313 293883568689494087593341769093878063985846473005889281759988444774861300631530687600700848372675277897773568300427789027721056838330214702797285953363321105640642639097245799496861629080196041417539357688 765879924285499121517379242703432486484142474568388895418932414509875057594030132496975416969553302968802193048741635010979200362102387682751763699809776149796360967043481401241306835768799049974365962964 957054595247353820003637703248949821033313329135623151698544104153170541939282347233988484535521732036880883121009439414349382822035496502815307510870986046812248029738256312449893319652962023726085865090 503079933086520012316719151827657420956895131361840954121214737863110428977178614481583169658487669495548262525049612270447147122296202746823629098038774693769873589421254417923552983874798304502539097887 334697326030975441564748054737327327672486527590349953363541269539004588549886835749278646152520408004901147858922890854433539969947808674716135197858385714564215831711930041179894407902683463575503398880 867251278835772976264992138274365739929273022387925769242327854872012972553860719683037824830637258998084846385038283562584039173118726943814645536516900625300232175913430847552159014752991492152969443623 669108332336937679931382092758700242462383312182367152367720984171877038601723085224480431763336027597331612012622483230853292889861545592214273785074109788222447295126635722255671697794097673415430172892 683326350774512101678691213344656807397973727114619192999381181788275414217929268837902854309099424412605119458492379099663295502638657011148841422661629698100736527109285045794708615080940545777978643015 048999586341647005282205627860088640257094324442540440342431402038120748575379990160664655209869807905893473202430506359073638215212806000418275293254852479279042357275985742095546323638309324282507115188 017756337398115237619946862632705506350998512543338755946015409008620142936256737383316930823288543270014874766351188308851737752688195263601653459005561607677134536176554509744249790760639060933000284169 648475940270466694684865936364254286252416448366521739225865284742449523633023053114134493323398223365516114314691319001704882268365259163997239126266161402057079967273835295974791254889614192872612597575 617015926458235411519221772539196510343447936803690570038130565578663110114763131895715563365187277579919088628907654949520194749221488514170792523523942938017011494852390058443583297487692799415863846408 772659017491049332388534654299792539005613115622882411471921581372101202673996486228316104302872687398403351421202995166108461931646880759445269652485700705544521525474934504348529179875121859736471904615 154135825821390401721182957023275370273897877935069040449385535876505035571558728732015968850613311454771015756993754410974933741159911991149627268017180389509078030411844000755854685609769656695843256272 833274164180445907278446800513607741542884127124563533836254690689364309020682167504598193217445133629138539831545606104596926045087877003041845791534782917257628106327221080358260609045724606192042375803 631472001587490753616337852434622987699178878086714539288465724172235048877668038694534745888319075973552928007092414713706966470295307005070830914124927714047761934590073152062336342261281370745041625204 734495974156788820038454467743889503791923445941712455102317389950303484219370880833297091081765610107086931580206950600964283520466473333611634766641063112470651738025105994092669089840466632986136488548 712306599035657723276676960571870572768143949325593713680293759746041160756415999194022667942306814857233613635929036768414803583280931275068011115716150627615566071582366122685442683302747258492948758520 897908509628352355279784914755637443184839934746333003309724970128084159009694551903758499457503794650191660098615027946061307947268985078496103038848460354233921754495058761571303447004158230802257866933 005121268318460095102035431743237832921768659760762754124218928081388728801758131092960201507463319795614881463334126748962568837843511784775926605772127342693283823847117460837822099396466123083439521695 765810654237719818995738404303159309732150599013712183997625850554354595163400551490805656273304753625289269450202261631309024207950062589313678130052221407429647561940537821824528330970215542109296386930 054600119271783027615635057157354056726525241759254363718634718362920121624566620936420746055008424493472898306195060775705287548452776806612183580661302914632889322407010438875350078519971591983900846545 969961971383997234958637498065824393846150491840484858191935606671259680185748778195611043352384208734177433851857356631012927574092805868400118048549941494787368829493687866372026826071987076562864367537 757095603497183974055655052694252183543013489107852345177955197575164847115459284660037545584854709947374937966158410404142398757633352017955186448566322015985563419342866689125221534463487912181596227445 253723142191847387705966599421812754036136604385388292018102048509177177914852560262425298024923092295621770627700276592881584739948042550677309034200434916329135886446274153184685174625801809013144773586 374865282212744506618836678735450371395355632603497782099924165591116020974374914323607878793310150524170474378235535062056170175721753870617511929197156603630283023438195849465943284604829319605151248671 236046256539035651733228567582109375412226742238470466647336202928248340651378144753677476718822200983896820197842167240154912533604364378474797706336579054181335230108045599585473796858647089377916593402 237955370452738494354411059838879697411430510694012710656285075370398233088678198682981714151852182714936131109639840219124483234239013925538117259541532094350029548076402919827657415140429566669531773040 033587015037034974248978981089394530269768782315579381589289968687663676035790553227948227576591048128352197457240223475699146502406367304928332861518750491298734579308749994880486812508029046064462235695 627679648989148699242019464585213551657098871183782904371743756252826061405346119873953346775009366257467656384596295218722627774734804912339651942813537250686607820766838625654872790380204867780999917543 808157898208252555662349839332174914938649662841168898746650054147482645999727520033700845425925443011903990412317527719937677998475512794480129138420343231548881379325248871720993811957221631481016702748 773791618309689373487201689449032996589325119965041096536746189148615994816320408919305772386303963118582133413371100963891138365968959147153709250739984616820464264472907889765255935051365469783646031838 206195605785175615049726618176490303049821385347386962122346261140430356009670425470123173604497246232874525751511987718015857428293890256508259882754951108654247042183372640230780456816514205178074181960 964015134617607943627696122281261186109127668148805009509638890328777108376510519000761280584739692587687379373066647513879422173546940211576755575689701687341043424465255225689743297161527425581105034950 457189317524470704103077608303655367141803887236029488728055907527111155907947569269039785196019397903117680703568019449361068506405685192906450486855356282567872257345441465655411878167177298506128740446 208907185021085180250529245903598141175227203205526425977519844107424921792420390800146062259994221097171761187468458026737248013656038669099710713472558597232170275540550850820904189875348292220041789984 750305195371790620015093330230238818065191824055508186721647117023075299226522280338204041133866253358150429341151439809399864163656339236206738742593427134447012427027222271975732031944894078563555116396 19115985907995399083680129468810771595938084908111251938016414866250141095286680914828503123938960997659175977315432797173945762560365023587931559926170852315074247849814256564

19 Let's not try to work out the support for all possible combinations. Subsets of frequent itemsets are frequent. All subsets of a set that meets the minimum support will also necessarily meet the minimum support. So if we know a subset is small, any superset must also be small. So, instead of trying all combinations, we'll generate itemsets for a particular size and scan the database to see if any of them meet the support threshold. We know that any subsets of frequent sets are also frequent and supersets of infrequent are also infrequent, so don't need to check them. Frequent Itemsets Association Rule Mining March 5, 2009 Slide 19 COMP527: Data Mining

20 Itemset Lattice Association Rule Mining March 5, 2009 Slide 20 COMP527: Data Mining Pruned supersets Infrequent (Lattice borrowed from CSE980 @ MSU)‏

21 The algorithm that does this is called A Priori and most other ARM techniques are based on it. Will look at it in more detail next week. A Priori Association Rule Mining March 5, 2009 Slide 21 COMP527: Data Mining

22 ARFF is a horrible horrible format for ARM. Most datasets are very sparse with the attributes being present or not present. Bread 0/1, Milk 0/1, etc. We want to record this as {bread, milk,cheese} not a huge table of 1s and 0s Weka doesn't include many ARM algorithms... In fact it has three, thankfully one is A Priori. The book doesn't include much information, but Dunham has good coverage. We'll also look at some other ARM applications built by Frans Coenen and Paul Leng here at Liverpool. Issues with WEKA and ARM Association Rule Mining March 5, 2009 Slide 22 COMP527: Data Mining

23 Witten 4.5 Dunham 6.1, 6.2 Han 5.1 Berry and Browne 14.1-14.3 Berry and Linoff Chapter 9 Zhang, Association Rule Mining, Chapter 1, 2.1, 2.2 Pal and Mitra, 8.3 Further Reading Association Rule Mining March 5, 2009 Slide 23 COMP527: Data Mining


Download ppt "M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009."

Similar presentations


Ads by Google