Download presentation
Presentation is loading. Please wait.
Published byAllen Brooks Modified over 9 years ago
1
Data Mining Jim King
2
What is Data Mining? A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for previously unknown relationships in large data sets Why? Improved technology allows for vast quantities of data to be gatheredImproved technology allows for vast quantities of data to be gathered Those relationships can perhaps be used to make future decisions and strategiesThose relationships can perhaps be used to make future decisions and strategies
3
How do we Data Mine? Three considerations to be made ClassificationClassification AssociationAssociation SequentialSequential
4
Classification Generate grouping rules Future data can then be classified quicklyFuture data can then be classified quickly Example: Disease classification based on symptoms may lead to better treatments
5
Association Two conditions occur together PresumptiveObjective With some probability (confidence) Cond1 => Cond2
6
Sequential Event B follows Event A Ex. In e-commerce, what links do people follow? After following links to a product, how often do they buy?After following links to a product, how often do they buy?
7
Classification Algorithms Hard clustering vs. Soft clustering Collection of classes { C1, C2,.. Cn }Collection of classes { C1, C2,.. Cn } Arbitrary Object OArbitrary Object O Soft Clustering: Classes may overlap where an object belongs to multiple classesSoft Clustering: Classes may overlap where an object belongs to multiple classes Hard Clustering: Every object may belong to only one class. No overlapHard Clustering: Every object may belong to only one class. No overlap
8
Classification One way: Agglomerative Every object is its own clusterEvery object is its own cluster Find two objects with least distanceFind two objects with least distance Combine into one clusterCombine into one cluster Stop when only one cluster remainsStop when only one cluster remains Returns hierarchy of the clusteringReturns hierarchy of the clustering Need to decide on some distance function
9
Classification Another way: Division method Everything initially in one clusterEverything initially in one cluster Split into two clustersSplit into two clusters Split each new cluster into two more clustersSplit each new cluster into two more clusters Stop when can’t divide any moreStop when can’t divide any more Requires more computational power, but usually worse results
10
Association Algorithms Given constraints, minimize the criteria need for a condition Bought cereal & eggs -> Bought milk 80% confidence80% confidence Bought cereal -> Bought milk 90% confidence90% confidence
11
Association Prune conditions which fall below minimum improvement yields simplifications Other constraints: Minimum confidence ( 30% with A include B)Minimum confidence ( 30% with A include B) Minimum support ( 2% have both A and B)Minimum support ( 2% have both A and B)
12
Sequential Algorithms People buy basic camping equipment Later buy other items related Starting with basic item sets, try to concatenate and find the resulting set among customer behavior
13
Sequential If resulting item set is not supported (at all or above a threshold), drop it Sequences do not have to be contiguous i.e. A customer buys A then B then C, sequence A then C is validi.e. A customer buys A then B then C, sequence A then C is valid
14
Case Study - SchulWeb Search Site for schools in Germany How to improve performance and user satisfaction? Use log to track user navigation patterns (i.e. What URLs requested, what order?) Extract Information from these
15
Interpretations of Mining Users don’t like to type text Prefer to select from available choices What were they looking for? Schools close to some regionSchools close to some region Used option to specify a state (for location)Used option to specify a state (for location) Used option to specify a school type (to limit search size)Used option to specify a school type (to limit search size)
16
Changes Made Made “Near Town” Default Made option obvious, people started to useMade option obvious, people started to use Limited region size further, short lists producedLimited region size further, short lists produced Shorter lists less intimidating, more people found what they needShorter lists less intimidating, more people found what they need
17
Conclusions Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks Can benefit business, medicine, science More efficient algorithms needed to speed up data mining process
18
Conclusions Making Data mining easier to use Data with rich descriptions (more fields)Data with rich descriptions (more fields) More Data/RecordsMore Data/Records Controlled/Reliable Data Collection (automated vs. manual)Controlled/Reliable Data Collection (automated vs. manual) Way to evaluate resultsWay to evaluate results Integrate information gained back into systemIntegrate information gained back into system
19
Final Questions? www.cs.unr.edu/~king
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.