Download presentation
Presentation is loading. Please wait.
1
Data Mining CSCI 307, Spring 2019 Lecture 3
Fielded Applications Ethics
2
Fielded Applications The result of learning -- or the learning method itself -- is deployed in practical applications These are actual ML problems put into use versus research problems or the toy problems....that we'll look at this semester. Processing loan applications Screening images for oil slicks Electricity supply forecasting Diagnosis of machine faults Marketing and sales (Big $$) Web Mining (PageRank, query content mining, etc.) Separating crude oil and natural gas (rules used in setting parameters and now take 10 minutes rather than a day) Reducing banding in rotogravure printing (reduced artifacts from 500 down to 30) Finding appropriate technicians for telephone faults (saved $10 million) Scientific applications: biology, astronomy, chemistry (analyzing cell structure accelerates drug discovery, cataloging celestial objects, etc.) Automatic selection of TV programs Monitoring intensive care patients
3
Processing Loan Applications
(American Express) Make a decision Decisions involving judgment. Given: Questionnaire with financial and personal information Question: Should money be lent? Simple statistical method covers 90% of cases Borderline cases (the other 10%) referred to loan officers But: 50% of accepted borderline cases defaulted! Solution: Reject all borderline cases? No! Borderline cases are most active customers....
4
Enter Machine Learning
1000 training examples of borderline cases 20 attributes: age years with current employer years at current address years with the bank other credit cards possessed, … Learned Rules: correct on 67% of cases Human Experts: correct only 50% of cases Rules could be used to explain decisions to customers
5
Screening Images Given: radar satellite images of coastal waters
Produce attributes Given: radar satellite images of coastal waters Problem: detect oil slicks in those images Oil slicks appear as dark regions with changing size and shape Not easy: Lookalike dark regions can be caused by weather conditions (e.g. high wind) Expensive process requiring highly trained personnel I am clearly not trained, I have no idea what these are.
6
Enter Machine Learning
Classification scheme is not what is deployed but the learning scheme itself. Enter Machine Learning Extract dark regions from normalized image Attributes: size of region shape, area intensity sharpness and jaggedness of boundaries proximity of other regions info about background Constraints/Problems: Few training examples—oil slicks are rare! Unbalanced data: most dark regions aren’t slicks Regions from same image form a batch—Different batches have different backgrounds Requirement: adjustable false-alarm rate Input: images Output: Smaller images with marked regions Use image processing algorithms to normalize AND produce the attributes. The learning scheme is applied to the attributes.
7
Fifteen years of data, plus "additive" factors.
Load Forecasting Do it faster Electricity supply companies need forecast of future demand for power Forecasts of min/max load for each hour ==> significant savings Given: Manually constructed load model that assumes “normal” climatic conditions Problem: Adjust for weather conditions Static model consists of: base load for the year load periodicity over the year effect of holidays Fifteen years of data, plus "additive" factors.
8
Enter Machine Learning
Prediction corrected using “most similar” days Attributes: temperature humidity wind speed cloud cover readings plus difference between actual load and predicted load Average difference among eight “most similar” days added to static model Linear regression coefficients form attribute weights in similarity function The resulting system yielded the same performance as expert forecaster, but took seconds to compute rather than hours.
9
Diagnosis of Machine Faults
Let the expert help by tweaking the rules Diagnosis of Machine Faults Diagnosis: Classical domain of expert systems Given: Fourier analysis of vibrations measured at various points of a device’s mounting Question: Which fault is present? Preventative maintenance of electromechanical motors and generators Information very noisy So far: diagnosis by expert/hand-crafted rules Not IF there is a fault, but WHAT is the fault?
10
Enter Machine Learning
Available: 600 faults with expert’s diagnosis ~300 unsatisfactory, rest used for training Expert was not satisfied with initial rules because they did not relate to his domain knowledge Further background knowledge resulted in more complex rules that were satisfactory Learned rules outperformed hand-crafted ones
11
Marketing and Sales I Make money Companies precisely record massive amounts of marketing and sales data Applications: Customer loyalty: Identifying customers that are likely to defect by detecting changes in their behavior (e.g. banks/phone companies) Special offers: Identifying profitable customers (e.g. reliable owners of credit cards that need extra money during the holiday season)
12
Marketing and Sales II Market Basket Analysis
Association techniques find groups of items that tend to occur together in a transaction (used to analyze checkout data) Historical analysis of purchasing patterns Identifying prospective customers Focusing promotional mailings (targeted campaigns are cheaper than mass-marketed ones) Recall Beer/Diapers on Thursdays (just a few items) and Saturdays (full shopping)
13
The Data Mining Process
14
Machine Learning and Statistics
Historical difference (grossly oversimplified): Statistics: testing hypotheses Machine learning: finding the right hypothesis But: huge overlap Decision trees Nearest-neighbor methods Today: perspectives have converged Most ML algorithms employ statistical techniques Chapter 4
15
Generalization as Search
Data Mining as search through all possible rule sets: Set of all possible rule sets is very large It's difficult to search the entire set Noise in the data could eliminate all rule sets Biases in searches help address issues Language Bias (How are rule sets described?) Search bias (Finding a good rule rather than the best rule) Overfitting avoidance bias (Simpler trees may be better at generalizing to new examples)
16
Data Mining and Ethics I
Ethical issues arise in practical applications Anonymizing data is difficult 85% of Americans can be identified from just zip code, birth date and sex Data mining often used to discriminate e.g. loan applications: using some information (e.g. sex, religion, race) is unethical Ethical situation depends on application e.g. same information ok in medical application Attributes may contain problematic information e.g. area code may correlate with race Anonymizing is hard. A Massachusetts governor received his health records in the mail after a MA hospital released “anonymized” data of records! Less damaging “problematic data” and the desire to get more data! I got a phone call “qualifying” me to go back to school. The benign questioning turned to the caller saying they had all my information, but they just needed to “verify” it.
17
Data Mining and Ethics II
Important questions: Who is permitted access to the data? For what purpose was the data collected? What kind of conclusions can be legitimately drawn from it? Caveats must be attached to results Purely statistical arguments are never sufficient! People must take results of DM along with their own knowledge and make decisions of how and whether to apply them. -- Fake LinkedIn accounts were created so that data from real accounts could be scraped and used. Also, the fake profiles themselves create an issue of trust in the system itself. ++ To prevent fires before they start, NYC Fire department used predictive models (using building age, vacancy rate, presence of suppression systems, etc.) to help compile a list of potentially-at-risk addresses so the city building inspectors can prioritize their work. (Used to choose buildings at random.) ++ To reduce waits in physician’s office, predict cancellations with more accuracy….with smart scheduling the claim is doctor’s can see lots more patients each week and wait times are reduced.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.