Download presentation
Presentation is loading. Please wait.
1
Data Mining http://www.mmds.org Modified from
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
3
Data contains value and knowledge
Knowledge discovered from data can be used for competitive advantage Data contains value and knowledge
4
Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science
also called Knowledge Discovery and Data mining (KDD) But to extract the knowledge data needs to be Stored Managed And ANALYZED Data Mining ≈ Big Data ≈ Predictive Analytics ≈ Data Science Data Mining: This term refers to applying the powerful tools of computer science to solve problems in science, industry, and many other application areas. Frequently, the key to a successful application is building a model of the data, that is, a summary or relatively succinct representation of the most relevant features of the data.
5
What is Data Mining? Given lots of data
Data mining is extraction of useful patterns from data sources or models of data Discover patterns and models that are: Valid: hold on new data with some certainty Useful: should be possible to act on the item Unexpected: non-obvious to the system Understandable: humans should be able to interpret the pattern
6
Why is data mining important?
CS583, Bing Liu, UIC Why is data mining important? Computerization of businesses produce huge amount of data How to make best use of data? Knowledge discovered from data can be used for competitive advantage Online businesses generate even larger data sets Online retailers (e.g., amazon.com) are largely driving by data mining Web search engines are information retrieval (text mining) and data mining companies Web Data Mining
7
Why is data mining necessary?
Make use of your data assets There is a big gap from stored data to knowledge; and the transition won’t occur automatically. Many interesting things that one wants to find cannot be found using database queries “find people likely to buy my products” “Who are likely to respond to my promotion” “Which movies should be recommended to each customer?” Web Data Mining
8
Why data mining? The data is abundant (big data)
CS583, Bing Liu, UIC Why data mining? The data is abundant (big data) The computing power is not an issue Data mining tools are available The competitive pressure is very strong Almost every company is doing (or has to do) it Web Data Mining
9
Data Mining Tasks Descriptive methods Predictive methods
Find human-interpretable patterns that describe the data Example: Classification, Clustering Predictive methods Use some variables to predict unknown or future values of other variables Example: Recommender systems
10
Data Mining Models and Tasks
Vijay Raghavan, University of Louisiana Lafayette
11
Classic data mining tasks
CS583, Bing Liu, UIC Classic data mining tasks Classification: mining patterns that can classify future (new) data into known classes supervised learning Association rule mining sets of data items that occur together frequently any rule of the form X Y, where X and Y are sets of data items Cheese, Milk Bread [sup =5%, confid=80%] Clustering identifying a set of similarity groups in the data unsupervised learning Web Data Mining
12
Classic data mining tasks (contd)
Sequential pattern mining: sets of data items that occur together frequently in some sequences A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence Deviation detection: discovering the most significant changes in data Data visualization: using graphical methods to show patterns in data. Web Data Mining
13
Meaningfulness of Analytic Answers
A risk with “Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find bogus Called “data dredging” in statistics Bonferroni’s Principle: If we are willing to view as an interesting feature of data something of which many instances can be expected to exist in random data, then we cannot rely on such features being significant. This observation limits our ability to mine data for features that are not sufficiently rare in practice.
14
Meaningfulness of Analytic Answers
Example: We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day 109 people being tracked 1,000 days Each person stays in a hotel 1% of time (1 day out of 100) Hotels hold 100 people (so 105 hotels) If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: 250,000 … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way
15
What matters when dealing with data?
Challenges Usage Quality Context Streaming Scalability Collect Prepare Ontologies Text Networks Signals Data Modalities Represent Structured Multimedia Model Reason Visualize Data Operators
16
Data Mining: Cultures Data mining overlaps with: Different cultures:
Databases: Large-scale data, simple queries Machine learning: Small data, Complex models CS Theory: (Randomized) Algorithms Different cultures: To a DB person, data mining is an extreme form of analytic processing – queries that examine large amounts of data Result is the query answer To a ML person, data-mining is the inference of models Result is the parameters of the model CS Theory Machine Learning Data Mining Database systems
17
Data Mining: Cultures Data Mining overlaps with machine learning, statistics, artificial intelligence, databases but more stress on Scalability (big data) Algorithms Computing architectures Automation for handling large data Statistics Machine Learning Data Mining Database systems
18
Data Mining: Related Areas
Database Management Systems Other Areas: Neural Networks Evolutionary Methods Information Retrieval Etc. The ‘Parent’ of Data Mining Machine Learning Statistics Pattern Recognition Visualization Artificial Intelligence Expert Systems Data Mining Sometimes considered an subarea of AI Sometimes considered an subarea of AI Vijay Raghavan, University of Louisiana Lafayette
19
Data Mining vs. Databases
Data mining is the process of extracting hidden and actionable patterns from data Database systems store and manage data Queries return part of stored data Queries do not extract hidden patterns Examples of querying databases Find all employees with income more than $250K Find top spending customers in last month Find all students from engineering college with GPA more than average
20
Knowledge Discovery and Data mining (KDD) process
Understand the application domain Identify data sources and select target data Pre-processing: cleaning, attribute selection, etc Data mining to extract patterns or models Post-processing: identifying interesting or useful patterns/knowledge Incorporate patterns/knowledge in real world tasks Web Data Mining
21
Knowledge Discovery and Data mining (KDD) process
Note, order of initial steps and what is included in each step may vary depending on who you talk to Patterns must be: valid, novel, potentially useful, understandable Vijay Raghavan, University of Louisiana Lafayette
22
How It All Fits Together
Locality sensitive hashing Clustering Dimensionality reduction High dim. data PageRank, SimRank Community Detection Spam Detection Graph data Filtering data streams Web advertising Queries on streams Infinite data Support vector machines Decision Trees Perceptron, kNN Machine learning Recommender systems Association Rules Duplicate document detection Apps
23
Examples of Data Mining Applications
Fraud/Spam Detections: Identifying fraudulent transactions of a credit card or spam s You are given a user’s purchase history and a new transaction, identify whether the transaction is fraud or not; Determine whether a given is spam or not Frequent Patterns: Extracting purchase patterns from existing records beer ⇒ dippers (80%) Forecasting: Forecasting future sales and needs according to some given samples Finding Like-Minded Individuals: Extracting groups of like-minded people in a given network Beer and chips are not interesting patterns, for example.
24
How do you want that data?
I ♥ data How do you want that data?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.