Data Mining Chapter 1.

Data Mining Chapter 1

Data vs Information Society produces huge amounts of data
Potentially valuable resource Needs to be organized – patterns underlying data

Data Mining People have always looked for patterns
Computing power makes more feasible Data Mining = Extraction of implicit, previously unknown, and potentially useful information from data …watch a baby learn – drop something, it falls, parents pick it up for them … 

Introduction • Info Hiding Behind Mountains of Data
• Money can be made • Helps People (Analysts) • More Practical due to Hardware Advances • Not a Miracle Info - Money - a small increase in response rate to a mailing can mean big bucks; a small decrease in default rates on loans can mean big bucks Helps- HELPS - not automates decision-making. Tool for business analysts Practical - disk space cheaper and more plentiful, faster compute, plus - more and more info avail Not Miracle - analyst must understand business, formulate specific problem, deal with problems with the data

Data Mining Has Commercial Value
Example Practical Uses Database Marketing Credit Approval Fraud Detection (e.g. Credit Card) NIB (not in book) Marketing - identify good candidates for mailings, target promotions Credit approval – determine appropriate criteria for credit approval Fraud - determine triggers that might indicate the transaction might be fraudulent (e.g. too many uses of card in short period) a small increase in response rate to a mailing can mean big bucks; a small decrease in default rates on loans can mean big bucks

3 General Techniques Statistical Machine Learning Visualization NIB
Statistical - e.g. Diaper and Beer sales correlated Machine Learning - my emphasis in DM class & book’s emphasis e.g. if community has high (relative) rate of unemployment then have high crime rate Visualization - e.g. Police in Phila using GIS display where cars are stolen and recovered - helps find chop shops and may help patrol patterns

Machine Learning Skip the philosophy
Acquisition of knowledge and ability to use it Computer performance via indirect programming rather than direct programming … I’m interested in success with data, not with whether it represents true intelligence … program written to tell what to explore, not what solution is …

Black Box vs Clear Box Black Box – what the program has learned and how it is using it is incomprehensible to humans Clear Box – what is learned is understandable Structural descriptions represent patterns explicitly (e.g. as rules) ….

Let’s Look at Some Data and Structural Patterns
clraw - Access database with data related to community crime – including socioeconomic and law enforcement data cl-revised-attribute-list – lists attribute names NJDOHalldata – Access database with data related to community addictions njcrimenominal – heavily processed from clraw – ready to run Weka data mining tools on …

Run Weka On njcrimenominal – On my-weather – Rules – Prism Tree – ID3
… this actually creates a “decision list” – the order of rules matters (like in a computer program) (I don’t know what game is being played in the book weather example – it is a famous example, from Quinlan. In this example I have modified so it fits my sensibilities for doing something outside that qualifies as exercise)

Black Box Some machine learning / data mining methods may be successful but not produce understandable structural descriptions E.g. neural networks This book does not focus on them Frequently what is desired is a “take home message” for future HUMAN decision making – this requires an understandable result of learning Frequently, even if the results of learning may be used “automatically” by a computer program, human decision to TRUST the automatic program may depend on the human being able to make sense and evaluate what has been learned … unlike book contact lens example … in clraw, and NJDOHalldata – real datasets – some attributes in some examples have missing values – that has been cleaned up in the data we ran (njcrimenominal) …

The Real World Examples learning from are NOT complete
Data is NOT complete Frequently there are ERRORS or mistakes in the data

Lessons from Simple Examples – Weather
Numeric vs “Symbolic” attributes Nominal/ categorical weather data could have at most 3 X 3 X 2 X 2 = 36 possible combinations of values With numeric values for temperature and humidity, there are a much larger set of possibilities, and the learning problem gets much more difficult – may need to learn inequality test (e.g. temp > 80)

Lessons from Simple Examples – Weather
Classification vs Association tasks In our examples so far, we learned rules / trees to predict the value for the “Play” attribute It was pre-determined what was to be predicted A less-focused approach would be to look for any rule that can be inferred from the data (e.g. if temp = cool then humidity = normal) It is possible to generate large sets of association rules; these must be carefully controlled

Lessons from Simple Data – Contact Lenses
Decision Rules vs Decision Trees Decision tree may be more compact representation of patterns in the data – see next two slides

Figure 1.1 Rules for the contact lens data.

Figure 1.2 Decision tree for the contact lens data.

Lessons from Simple Data – CPU performance
In some cases, what is desired is a numeric prediction rather than a classification (here prediction of CPU performance) – see table 1.5 on p15 This is more challenging, and some methods used for classification are not appropriate for numeric prediction Statistical regression is the standard for comparison

Lessons from Simple Data – Labor Negotiations
As we saw with crime dataset, realistic datasets sometimes have missing values. This is also seen in the labor negotiations dataset (sketched in Table 1.6 on p16) Also – not all data mining involves thousands of records – generating this 57 record dataset was A LOT OF WORK

Lessons from Simple Data – Labor Negotiations
Training and Testing – A program that uses data to learn (train) should be tested on other data (as a independent judgment) Overfitting – Getting 100% correct on training data may not always be the best thing to do May be adjusting for idiosyncratic aspects of the training data that will not generalize to yet to be seen instances Assuming the goal is to learn something that will be useful in the future, then it is better to avoid “overfitting” what is learned to the training data – see next slide

Figure 1.3 Decision trees for the labor negotiations data.
… right side more precisely matches the training data – but is a more complex thing to learn – and if the extra complexity doesn’t pay off on future data then it hurts us. A lot of times a simpler learned pattern will be better for the future. We usually want to get an indication of how successful we expect learning to be by testing the results on an independent set of test data that was not used during training (a) (b)

Fielded Applications Loan approval (for borderline cases)
20 attributes – age, years with current employer, years at current address, years with bank, other credit … 1000 training cases Learn rules 2/3 correct vs ½ correct by human decision makers Rules could be used to explain decision to rejected applicants

Fielded Applications Marketing and Sales
E.g. bank detecting when they might be in danger of losing a customer E.g. bank by phone customers who call at times when response is slow Market Basket Analysis (Supermarkets and …) Use of “association” techniques to determine groups of items that tend to occur together in transactions Famous example – diapers and beer May help in planning store layouts Send out coupons for one of the items Give out cash register coupons for one when other is purchased

Fielded Applications Direct Marketing
Better response rate allows fewer mailings to produce same result, or same mailings to produce larger sales May use data from outside organization too (e.g. socio-economic data based on zip code)

Other Fielded Applications
Oil slick detection from images Energy use (load) prediction Diagnosis of machine faults … issue – unbalanced data – most examples are NOT oil slicks … issue – significant data prep to take into account cyclical nature of data … issue – engineering of attributes – introducing new attributes that relate two (or more) primitive attributes. Morals – iteration to achieve success; domain expert’s view of the rules generated important in the adoption of the system.

1.5 Learning as “Search” The possible sets of rules or the possible decision trees is too large to consider all and pick the best Many learning algorithms follow a “hill climbing” approach to search – try one possibility, then look for small changes that will improve; stop when no improvement can be found. Such a process IS NOT guaranteed to find the best solution. It is “heuristic” – it uses rules of thumb that are likely to help, but which are not a sure thing. But it is more tractable than exhaustive brute-force. In many learning schemes, search is also “greedy” – decisions, once made, are not retracted Learning methods have a “bias” – they are more likely to come to some conclusions that to others – for efficiency sake … this is a very important idea! … e.g. heuristic for daily life – if it is sunny out, don’t take an umbrella. Heuristic for machine learning, learning rules sets – first make a rule using the attribute that best divides records into their categories … if you start out building a decision tree putting a particular attribute at the top of the tree, a greedy algorithm will not go back later and reconsider what should be at the top – generally this is for efficiency reasons – can’t exhaustively search through all possibilities … I have come to believe that over generalizing might be evolutionarily adaptive … kids of a certain preschool age (without necessarily being taught) will start assuming that all past tenses are ‘ed’ and all plurals are ‘s’, and will even stop using irregulars that they previously had known (e.g. caught/ catched) Most of the rest of this section (1.5) is probably above where you can understand right now

1.6. Data Mining and Ethics Discrimination Privacy policies
Human common sense on using conclusions Profits vs service … even if by substitute variables – e.g. zip code instead of race. How about a black box such as a neural net – can you tell if it is discriminating? … people should know how their personal info is being used … “The point is that data mining is just a tool in the whole process; it is people who take the results, along with other knowledge, and decise what action to apply” … “should the supermarket manager place the beer and chips near each other, to make it easier for shoppers? Or farther apart, maximizing shoppers time in the store – leading to more impulse purchases? Should the manager move the most expensive, profitable diapers near the beer and add further luxury baby products nearby? Data, information, knowledge, wisdom

End Chapter 1 Show coming into Weka from Start, opening file, visualizing data

Data Mining Chapter 1.

Similar presentations

Presentation on theme: "Data Mining Chapter 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Chapter 1.

Similar presentations

Presentation on theme: "Data Mining Chapter 1."— Presentation transcript:

Similar presentations

About project

Feedback