Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Data Mining Jim King

What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for previously unknown relationships in large data sets  Why? Improved technology allows for vast quantities of data to be gatheredImproved technology allows for vast quantities of data to be gathered Those relationships can perhaps be used to make future decisions and strategiesThose relationships can perhaps be used to make future decisions and strategies

How do we Data Mine?  Three considerations to be made ClassificationClassification AssociationAssociation SequentialSequential

Classification  Generate grouping rules Future data can then be classified quicklyFuture data can then be classified quickly  Example: Disease classification based on symptoms may lead to better treatments

Association  Two conditions occur together PresumptiveObjective  With some probability (confidence) Cond1 => Cond2

Sequential  Event B follows Event A  Ex. In e-commerce, what links do people follow? After following links to a product, how often do they buy?After following links to a product, how often do they buy?

Classification Algorithms  Hard clustering vs. Soft clustering Collection of classes { C1, C2,.. Cn }Collection of classes { C1, C2,.. Cn } Arbitrary Object OArbitrary Object O Soft Clustering: Classes may overlap where an object belongs to multiple classesSoft Clustering: Classes may overlap where an object belongs to multiple classes Hard Clustering: Every object may belong to only one class. No overlapHard Clustering: Every object may belong to only one class. No overlap

Classification  One way: Agglomerative Every object is its own clusterEvery object is its own cluster Find two objects with least distanceFind two objects with least distance Combine into one clusterCombine into one cluster Stop when only one cluster remainsStop when only one cluster remains Returns hierarchy of the clusteringReturns hierarchy of the clustering Need to decide on some distance function

Classification  Another way: Division method Everything initially in one clusterEverything initially in one cluster Split into two clustersSplit into two clusters Split each new cluster into two more clustersSplit each new cluster into two more clusters Stop when can’t divide any moreStop when can’t divide any more Requires more computational power, but usually worse results

Association Algorithms  Given constraints, minimize the criteria need for a condition  Bought cereal & eggs -> Bought milk 80% confidence80% confidence  Bought cereal -> Bought milk 90% confidence90% confidence

Association  Prune conditions which fall below minimum improvement yields simplifications  Other constraints: Minimum confidence ( 30% with A include B)Minimum confidence ( 30% with A include B) Minimum support ( 2% have both A and B)Minimum support ( 2% have both A and B)

Sequential Algorithms  People buy basic camping equipment  Later buy other items related  Starting with basic item sets, try to concatenate and find the resulting set among customer behavior

Sequential  If resulting item set is not supported (at all or above a threshold), drop it  Sequences do not have to be contiguous i.e. A customer buys A then B then C, sequence A then C is validi.e. A customer buys A then B then C, sequence A then C is valid

Case Study - SchulWeb  Search Site for schools in Germany  How to improve performance and user satisfaction?  Use log to track user navigation patterns (i.e. What URLs requested, what order?)  Extract Information from these

Interpretations of Mining  Users don’t like to type text  Prefer to select from available choices  What were they looking for? Schools close to some regionSchools close to some region Used option to specify a state (for location)Used option to specify a state (for location) Used option to specify a school type (to limit search size)Used option to specify a school type (to limit search size)

Changes Made  Made “Near Town” Default Made option obvious, people started to useMade option obvious, people started to use Limited region size further, short lists producedLimited region size further, short lists produced Shorter lists less intimidating, more people found what they needShorter lists less intimidating, more people found what they need

Conclusions  Data mining is a useful tool with multiple algorithms that can be tuned for specific tasks  Can benefit business, medicine, science  More efficient algorithms needed to speed up data mining process

Conclusions  Making Data mining easier to use Data with rich descriptions (more fields)Data with rich descriptions (more fields) More Data/RecordsMore Data/Records Controlled/Reliable Data Collection (automated vs. manual)Controlled/Reliable Data Collection (automated vs. manual) Way to evaluate resultsWay to evaluate results Integrate information gained back into systemIntegrate information gained back into system

Final Questions?  www.cs.unr.edu/~king

Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Similar presentations

Presentation on theme: "Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for.

Similar presentations

Presentation on theme: "Data Mining Jim King. What is Data Mining?  A.k.a. knowledge discovery The search for previously unknown relationships in large data setsThe search for."— Presentation transcript:

Similar presentations

About project

Feedback