DATA MINING Prof. Sin-Min Lee Surya Bhagvat CS 157B – Spring 2006
Making sense out of data With the hard drives prices becoming inexpensive the amount of data stored in the databases by the corporations has increased dramatically. Just having the raw data in the database is of no use unless someone makes sense of the data. For example one could store a decade of customer data but for the data to become useful one needs to find the patterns in the data to identify the customer behavior. Would SQL solve the above problem?
Traditional SQL and Analytics Traditional SQL is useful in performing very large queries and one could argue saying that SQL is all but necessary in order to get the information. This argument holds good for small sets of data but when a query is performed against a huge database which stores about terabytes of data then the performance of SQL would go down. Also identifying patterns in the data is not always feasible with the traditional SQL querying. This is where the field of Analytics come into play
Analytics Analytics is basically identifying patterns of data in order to make better decisions. For example if you are maintaining a commercial ecommerce web site, then one thing which you want to know would be the visitors behavior patterns like from which search engine they came from, how they go on about searching for items in your web site and so on. Basically what we are trying to do here is identify the patterns of customer behavior which would be useful later on to target that particular customer with promotional offers.
Analytics (Continued….) Google recently came up with Google Analytics for free. The URL for this is site is Right now one needs to do sign up for their invitation and once they accept it all one needs to do is to include google analytics tracking code in your web site and then you can start monitoring the customer behavior.
Transactional Systems In transactional systems the information about day-to-day transactions is stored. For example retail stores like Safeway records each transaction that happens during the day at the time the purchase is made. Identifying patterns on transactional systems is relatively hard because the data stored in these systems usually run up to terabytes and if a SQL query is performed across such a huge database then it may bring the whole system down. So what’s the alternative?
Decision support Systems For decision making activities like to determine patterns or to run complex SQL’s a separate database or system is usually maintained and those systems are known as Decision Support systems. The high level data is pulled out from the transactional systems and then stored into these databases for performing analytics or data mining techniques. The downside to this is the data may not be real time. But a service could be written which runs in the background which updates the decision support systems at real time.
Decision support systems (contd…) Decision support systems can be classified into three kinds Statistical analysis, OLAP (On-line Analytical Processing) and Data warehouses. If detailed statistical analysis of data needs to be performed then SQL is very limited and one needs to go for commercial packages like SAS. Further information could be found at ?sgc=u ?sgc=u
Decision support systems (contd….) OLAP provides very fast access to data. The data from RDBMS is gathered and placed it into multidimensional cubes which are then made available to the users. Cognos powerplay is the best selling OLAP product. The link to this product is
Data warehousing The third kind of a decision support system is data warehouse. Data mining is usually performed on these data warehouses. The data in an enterprise is usually stored in various transactional systems or databases. For example some data might be stored in Oracle database, the other data might be stored in DB2 or Teradata or in some systems it may just be stored in text files or excel files. When one wants to combine all this data to look for patterns it becomes very difficult, so all this disparate data from various different sources are pulled together to form a data warehouse.
Data warehousing (Contd…) The steps involved in building a data warehouse includes: 1)Getting the raw data from different sources and storing it as is in a temporary staging area. Typically ETL tools are used for this process. 2) The data from the temporary staging area is then cleansed and various business rules are applied to load the data into the actual data warehouse tables.
Predictive analytics and Data Mining Data Mining is about finding the patterns in data and is essentially used for predicting customer behavior. For example Data Mining could be used to predict based on customer complaints whether that customer is going to go to another competitor. Applications of Data Mining are varied and is used in almost all applications from CRM to Earthquake predictions.
Predictive analytics and Data Mining Predictive analytics is based on predictor, a single value. Predictive analytics is extensively used in CRM applications. A predictor for a customer could be 'Recent purchase' made. For example if you are calling customers for promotions then based on this predictor one would call the most recent customer first followed by the customers who purchased items like a month ago.
Procedures in Data Mining The key procedures used in Data mining include : 1)Association rules 2)Classification 3)Clustering
Association rules Association rules have an associated population which consists of a set of instances. For example if one buys an iPod from Amazon.com then the association with this product would be the accessories that come with iPod and displayed by Amazon include Apple iPod Nano Armband Grey, Apple iPod Nano Dock and Apple iPod Nano Lanyard Headphones. Association rule measures are Support and Confidence
Association rules Support: Is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. For example the support for iPod=>DVD player is percent, that means the support is very low. Confidence: Is a measure of how often the consequent is true when the antecedent is true. For example the rule iPod=>Apple iPod Nano Armband Grey would be say 80 percent
Support and Confidence examples
Classification The most popular way to classify the items is using Decision tree classifiers. In the example degree is masters and the person's income is 40K starting from the root, we follow the edge labeled 25K to 75K to reach a leaf. The class at the leaf is "good" so we predict that the credit risk of that person is good
Clustering Grouping similar data into clusters is what clustering is all about. The degree of association would be strong in the case of same cluster and weak between different clusters Clustering is based on the distance measures like Euclidian, probabilistic etc. K-means is one of the most famous clustering algorithm
Resources A.Silberschatz, H.F. Korth, S. Sudarshan Database System Concepts, 5th Ed., McGraw-Hill, c=u