Download presentation
Presentation is loading. Please wait.
2
Gio Wiederhold PDM 1 Profiting from Data Mining Gio Wiederhold November 2003
3
Gio Wiederhold PDM 2 Steps needed to profit 1.Obtaining relevant data –Always incomplete 2.Extracting relationships –Imputing causality 3.Finding applicability –Determining leverage points 4.Inventing candidate actions –Assessing likely outcomes and benefits 5.Selecting action to be taken –Measuring the outcome Collecting data for next round ? Model based
4
Gio Wiederhold PDM 3 Today's Problem: Disjointness 1.Database administrators Focus on data collection, organization, currency 2.Analysts Focus on slicing, dicing, relationships 3.Middle managers Focus on their costs, profits 4.MBAs Focus on business models, planning 5.Executives Must make decisions based on diverse inputs
5
Gio Wiederhold PDM 4 1. Data Collection Two choices 1.(rare) Collect data specifically for analysis allows careful design -- model causes and effects Purchase = f(price, color, size, custumer inc., gender,.,, costly often small to make collection manageable imposes delays 2.(common) Use data collected for other purposes take advantage of what is readily available low cost filtering, reformatting, integration incomplete - rarely covers all causes / effects biased -- missing categories only people with phones, cars -- shopping in super markets
6
Gio Wiederhold PDM 5 1a. Data Integration Needed when sources have inadequate coverage in distinct DBs for – Prices, Number purchased –Customer segments (supermarket, stores, on-line) implies some expectations append attributes where keys match: Joe include semantic match Joe = 012 34 567 append rows where key types match: customer include semantic match customer = owner
7
Gio Wiederhold PDM 6 2. Data analyis Find relationships –already known - ignore or adjust in next round »requires comparison with expert knowledge »now have quantification –unknown »uninteresting per expert »interesting per expert
8
Gio Wiederhold PDM 7 3. Establish causality Already known -- Prior Model –B ut is it complete, i.e., does it explain all effects ? Analyze relationships – use expertise to decide direction »often obvious "common world knowledge" »sometimes ambiguous smoking Cancer not-smoking »often major true cause not captured in data food color 10%, food price 20%, buyer gender 2% unknown 75% guess: ethnicity, income purchase of Chinese vs other food invent surrogates: names, ZIP codes, use temporal information
9
Gio Wiederhold PDM 8 Establishing causality is risky 1. Is a Volvo a safe car? 2. What causes accidents?Drivers! 3. Who buys Volvos? 4. Must determine effect of safe drivers percentage of safe drivers overall percentage of safe drivers with Volvos 5. How much of the accident rate is now explained? The unexplained difference can be attributed to the car. Careful drivers! Mined: Volvos have fewer accidents
10
Gio Wiederhold PDM 9 Change cause create effects To use results of data mining have to understand direction of relationships interesting beneficial effects side effects controllable causes external causes hidden captured by data Model
11
Gio Wiederhold PDM 10 4. Causes provide the leverage Language of analyst / Language of modeling Many causes -- independent variables –A few may be controllable –Some may be controlled by our competition –Others are forces-of-nature Even more effects -- dependent variables –A few may be desired –Some may be disastrous –Many are poorly understood Intermediate effects –Provide a means for measuring effectiveness –Allow correction of actions taken
12
Gio Wiederhold PDM 11 5. Planning & Assessment Analyze Alternatives Current Capabilities Future Expectations Process tasks: List resources Enumerate alternatives Prune alternative Compare alternatives now Predict the future
13
Gio Wiederhold PDM 12 Prediction Requires Tools E-mail this book, Alfred Knopf, 1997
14
Gio Wiederhold PDM 13 Simulations predict 1.Back-of-the-envelope Common Adequate if model is simple Assumptions are easily forgotten after some time, not distinguished from data "Why are we doing this" 2.Spreadsheets Most common computing tool Specialist modeler can help New, recent data can be pasted in Awkward for the tree of future alternatives 3. Constructed to order Costly, powerful technology Specialist modelers required Expressive simulation languages Requires specialists to set up, run, and rerun with new data Iv gH Xy mN DM
15
Gio Wiederhold PDM 14 Simulation results: likelihoodstime Next period alternatives uncertainty increases and subsequent periods 0.4 0.6 0.18 0.15 0.13 0.25 0.2 0.17 0.4 0.3 0.19 now 0.1 0.11 0.12 0.3
16
Gio Wiederhold PDM 15 Simulation services Wide variety, but common principle Inputs Model Output (time, $, place,...) 1.Spreadsheets Identify independent, controlable, and resulting values 2. Execution specific to query : what-if assessment –may require HPC power for adequate response 3. Continously executing : weather prediction –Search for best match ( location, time ) 4. Past simulations results collected for future use Typically sparse -- the dimension of the futures is too large: –Tables in a design handbook: materials Perform inter- or extra-polations to match query parameters
17
Gio Wiederhold PDM 16 6. Specify Value of Effects Still needed: Value of alternative outcomes Decision maker / owner input –Benefits and Costs –Potential Profit –Correct for risk, and adjust to present value past now futures 10002000500010000-2000-6000Values time
18
Gio Wiederhold PDM 17 Having it all together Relationships from analyses of past data Data representing the current state List of actionable alternatives Tree of subsequent alternatives Probabilities of those alternatives Values of the outcomes Ability to predict the likelihood of futures 0.4 0.6 0.18 0.15 0.13 0.25 0.2 0.17 0.4 0.3 0.19 0.1 0.11 0.12 0.3 10002000500010000-2000-6000Values
19
Gio Wiederhold PDM 18 Vision: Putting it all together Combine results mined from past data, current observations, and predictions into the future. o o o o o o time Support specialists Decision Maker
20
Gio Wiederhold PDM 19 Needed: Information Systems that also project seamlessly into the Futures Support of decision-making requires dealing with the futures, as well the past Databases deal well with the past Streaming sensors supply current status Spreadsheets, simulations deal with the likely futures Future information systems should combine all these sources time past now future
21
Gio Wiederhold PDM 20 Connecting it all Build super systems Coherent, consistent Expensive Unmaintainable Too many cooks: –Database folk –Data miners –Analysts –Planners –Simulation specialists –Decision makers Develop interfaces Incremental Composable as needed Heterogeneous Interfaces required: Metadata –Database to miners: SQL –Mined results to analysts: XML? –Analysts to planners ? –Planners to Simulations? SimQL –Decision makers: New tools !
22
Gio Wiederhold PDM 21 Interfaces enable integration: New: SimQL to access Simulations time past now futures Msg systems, Sensors Streaming data Databases and schemas, accessed via SQL or XML Simulations, accessed via SimQL and schema compliant wrappers
23
Gio Wiederhold PDM 22 Parser Metadata Manager Query manager Schema Manager Wrapped..Simulations Metadata Development Interaction Production Interaction Filing of Access Specs Use of Access Specs Initiation and Results of Simulations Schema Commands Schema Commands Help Error reports CustomerDeveloper Help Query SimQL proof-of-concept Implementation o o
24
Gio Wiederhold PDM 23 Demonstration of SimQL Business planning spreadsheets Weather on the Internet Engineering simulation wrapper Test Applications Simple GUI common language requirements Shipping location database
25
Gio Wiederhold PDM 24 Information system use of simulation results Simulation results are mapped to alternative Courses-of-actions Information system should support model driving the the computation and recomputation of likelihoods Likelihoods change as now moves forwards and eliminates earlier alternatives. time 0.4 0.6 0.2 0.5 0.3 0.50.20.1 0.1 0.1 0.03 0.07 0.1 0.5 0.3 0.2 prob
26
Gio Wiederhold PDM 25 The likelihoods multiply out to the end-effects then their values can be applied to earlier nodes 10002000500010000-6000-3000Values 1200 66 66 134 134-1220 1266 - 1086 past now future time Next period alternatives 0.4 0.6 0.1. and subsequent periods prob 0.1. 0.2 0.1 0.5 0.30.2 0.1 0.07 0.4 0.3 0.13. 0.3 0.2 value 100 100 600 600 1100 500 200 200 200 200 -420 0 -420 0 -820 -400
27
Gio Wiederhold PDM 26 Recomputation is needed at the next time phase past now future Re-assess as time marches forward ! A Pruned Bush A Pruned Bush Databases,... Spreadsheets, other simulations, Msgs sensors 10002000500010000 100 600 600 1100 500 200 200 200 200 0 1200 66 66 time 1266 ? ??
28
Gio Wiederhold PDM 27 Even the present needs SimQL time past now future last recorded observations simple simulations to extrapolate data Is the delivery truck in X? Is the right stuff on the truck? Will the crew be at X? Will the forces be ready to accept delivery? point-in-time for situational assessment Not all data are current:
29
Gio Wiederhold PDM 28 Integrative information systems: research questions What human interfaces can support the decision maker? How to move seamlessly from the past to the future? What system interfaces are good now and stay adaptable How can multiple futures be managed (indexed)? How can multiple futures be compared, selected? How should joint uncertainty be computed? How can the NOW point be moved automatically?
30
Gio Wiederhold PDM 29 SimQL research questions How little of the model needs to be exposed? How can defaults be set rationally? How should expected execution cost be reported? How should uncertainty be reported? Are there differences among application areas that require different language structures? Are there differences among application areas that require different language features? How will the language interface support effective partitioning and distribution?
31
Gio Wiederhold PDM 30 Moving to a Service Paradigm Interfaces define service potentials Server is an independent contractor, defines service Client selects service, and specifies parameters Server’s success depends on value provided Some form of payment is due for services x,y Databases are a current example. Simulations have the same potential.
32
Gio Wiederhold PDM 31 Summary of SimQL A new service for Decision Making : follows database paradigm –( by about 25 years ) coherence in prediction –displacement of ad-hoc practices seamless information integration –single paradigm for decision makers simulation industry infrastructure –investment has a potential market –should follows database industry model: Interfaces promote new industries
33
Gio Wiederhold PDM 32 extensions for network support are also disjoint Do not interoperate Summary: Today decision making support is disjoint, each community improves its area and ignores others Distribution Databases Simulation Planning Science
34
Gio Wiederhold PDM 33 The decisionmaker has few tools Spreadsheets Planning of allocations Other simulations various point assessments past now future time Data integration distributed, heterogeneous x17 @qbfera ffga 67.78 jjkl,a nsnd nn 23.5a Databases Intuition + organized support disjointed support
35
Gio Wiederhold PDM 34 Databases Coda: Put relevant work together and move on Support integration of results mined from past data, current observations, and predictions about the futures. o o Simulation Support Services Decision Maker Service interfaces Human interfaces Data Mining o o Modeling tools o o ? Real Information InformationSystems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.