Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Next Frontier From a Personal Viewpoint Rakesh Agrawal Microsoft Search Labs Mountain View, CA.

Similar presentations


Presentation on theme: "1 The Next Frontier From a Personal Viewpoint Rakesh Agrawal Microsoft Search Labs Mountain View, CA."— Presentation transcript:

1 1 The Next Frontier From a Personal Viewpoint Rakesh Agrawal Microsoft Search Labs Mountain View, CA

2 2 Central Message Data Mining has made tremendous strides in the last decade It’s time to take data mining to the next level of contributions We will need to expand our view of who we are and develop new abstractions, algorithms and systems, inspired by new applications

3 3 Outline Retrospective on KDD-99 Keynote - “Data Mining: Crossing the Chasm” Developments since then New challenges

4 4 Data Mining: Crossing the Chasm* (Circa 1999) The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology. Chasm Techies: Try it! Visionaries: Get ahead of the herd! Pragmatists: Stick with the herd! Conservatives: Hold on! Skeptics: No way! Early MarketMainstream Market *Geoffrey A Moore. Crossing the Chasm. Harper Business. 1991.

5 5 Backdrop: Quest Experience Started as skunk work in IBM Almaden in early nineties Inspired by needs articulated by industry visionaries New abstractions, technologies IBM Intelligent Miner (Circa 1996) Serious product Fast, scalable, multiple platforms (including SP2) “Early market” successes/Awards

6 6 Imperatives for Chasm Crossing (Circa 1999) Data Mining Standards Data Mining Benchmarks Auto-focus Data Mining Database Integration Web: Greatest Opportunity Personalization Watch for Privacy Pitfall

7 7 Outline Retrospective on KDD-99 Keynote - “Data Mining: Crossing the Chasm” Developments since ‘99 New challenges

8 8 Scorecard (Circa 2006) Data Mining Standards → Data Mining Benchmarks → Auto-focus Data Mining → Database Integration → Web → Personalization → Privacy Pitfall → PMML/CRISP KDD Cups? Embedded in Solutions Commercial Offerings Under-estimated Importance Nascent Privacy-Preserving Data Mining

9 9 PMML: Predictive Model Markup Language Markup language for sharing models between applications (mine rules with one application; use a different application to visualize, analyze, evaluate or otherwise use the discovered rules). … … … <AssociationRule support="1.0" confidence="1.0" antecedent="1" consequent="2" /> …

10 10 Database Integration Tight coupling through user-defined functions and stored procedures Use of SQL to express data mining operations Composability: Combine selections and projections Object-relational extensions enhance performance Benefit of database query optimization and parallelism carry over SQL extensions

11 11 Sigmod 03, DIVO 04 Sovereign Information Integration Separate databases due to statutory, competitive, or security reasons.  Selective, minimal sharing on a need- to-know basis. Example: Among those patients who took a particular drug, how many with a specified DNA sequence had an adverse reaction?  Researchers must not learn anything beyond counts. Algorithms for computing joins and join counts while revealing minimal additional information. Minimal Necessary Sharing R  S  R must not know that S has b and y  S must not know that R has a and x v u RSRS x v u a y v u b R S Count (R  S)  R and S do not learn anything except that the result is 2. Medical Research Inst. DNA Sequences Drug Reactions

12 12 Privacy Preserving Data Mining 128 | 130 |...126 | 210 |... Randomizer 161 | 165 |...129 | 190 |... Reconstruct distribution of LDL Reconstruct distribution of weight Data Mining Algorithms Data Mining Model Kevin’s LDL Kevin’s weight Julie’s LDL 126+35 Preserves privacy at the individual patient level, but allows accurate data mining models to be constructed at the aggregate level. Adds random noise to individual values to protect patient privacy. EM algorithm estimates original distribution of values given randomized values + randomization function. Algorithms for building classification models and discovering association rules on top of privacy- preserved data with only small loss of accuracy. Sigmod00, KDD02, Sigmod05

13 13 Enterprise Applications Galore! Example: SAS Customer Successes Customer Relationship Management Claims Prediction Customer Relationship Management Claims Prediction | Credit Scoring | Cross-Sell/Up-Sell | Credit Scoring Cross-Sell/Up-Sell Customer Retention Customer Retention | Marketing Automation | Marketing Optimization | Marketing Automation Marketing Optimization Segmentation Management Segmentation Management | Strategic Enrollment Management Strategic Enrollment Management Drug Development Financial Management Activity-Based Management Financial Management Activity-Based Management | Fraud Detection Fraud Detection Human Capital Management Information Technology Management Charge Management Information Technology Management Charge Management | Resource Management | Resource Management Service Level Management Service Level Management | Value Management Value Management Regulatory Compliance Fair Banking Performance Management Balanced Score-carding Quality Improvement Risk Management Supplier Relationship Management Supply Chain Analysis Demand Planning Supply Chain Analysis Demand Planning | Warranty Analysis Warranty Analysis Web Analytics http://www.sas.com/success/solution.html

14 14 Some Surprises Impact of technology Time Popular technology visions often overestimate near-term prospects... …but they underestimate long- term developments. Source: SRI Consulting Business Intelligence Acknowledgement: Ray Amara

15 15 Identifying Social Links on the Web Input: Crawl of about 1 million pages. Find associations of what names occur together. WebFountain Project, IBM Almaden.

16 16 Discovering Technology Trends from Patent Databases Input: i) patent database ii) shape of interest. Find frequent patterns (e.g., sequence of phrases) in each time period. Use shape-based query language to identify trends over the change in support Lent et al, “Discovering Trends in Text Databases”, KDD 97.

17 17 Discovering Online Micro-communities Frequently co-cited pages are related. Pages with large bibliographic overlap are related. Japanese elementary schools Turkish student associations Oil spills off the coast of Japan Australian fire brigades Aviation/aircraft vendors Guitar manufacturers R Kumar et al., “Trawling the web for emerging cyber-communities”, WWW 99.

18 18 Template Detection for the Web Templates: Collection of pages that share the same look and feel, and are controlled by a single authority. Templates violate the Hypertext IR principles (relevant linkage, topical unity, lexical affinity) Detection instance of frequent itemset counting problem. Eliminating templates results in “surprising increases in precision at all levels of recall” Bar-Yossef et al. “Template Detection via Data Mining and its Applications”, WWW 02.

19 19 Ranking Search Results in MSN Search results ranked dynamically by a neural net. Ranking function learnt using a gradient descent method. Training data: Some query/document pairs labeled for relevance (excellent, good, etc.). Feature set: query independent features (e.g. static page rank) plus query dependent features extracted from the query combined with additional sources (e.g. anchor text). Best net selected by computing NDCG metric on a validation set. Burges et al. “Learning to rank using gradient descent”, ICML 05.

20 20 Google’s Data Mining Platform MapReduce 1 : Programming Model map(ikey, ival) -> list(okey, tval) reduce(okey, list(tval)) -> list(oval) Automatic parallelization & distribution over 1000s of CPUs Log mining, index construction, etc BigTable 2 : Distributed, persistent, multi-level sparse sorted map Tablets, Column family >400 Bigtable instances Largest manages >300TB, >10B rows, several thousand machines, millions of ops/sec Built on top of GFS Timestamps t3t3 t 11 t 17 “ …” contents cnn.com 1 Dean et. al. “MapReduce: Simplified data processing on large clusters”, OSDI 04. 2 Hsieh. “BigTable: A distributed storage system for structured data”, Sigmod 06.

21 21 Data Mining Community KDNuggets subscribers grew from 50 in 1993 to 11,000 in 2002. 1500 SIGKDD members Several “research-oriented” conferences: KDD, ICDM, SIAM DM, PKDD, PAKDD,… Strong attendance in the conferences. Strong ties with database, statistics, machine learning, and other communities. Curtsey Gregory Piatetsky-Shapiro, August 06.

22 22 A snapshot of progress Algorithmic innovations Database integration Foundations Usability Enterprise applications Unanticipated applications

23 23 Have we crossed the chasm? Yes Dorothy! So, what’s next?

24 24 Outline Retrospective on KDD-99 Keynote - “Data Mining: Crossing the Chasm” Developments since ‘99 Challenges

25 25 New Application Focus: From Enterprise to Humane Data Mining “Is it right? Is it just? Is it in the interest of mankind?” Woodrow Wilson Remarks at Suresnes Cemetery, May 30, 1919.

26 26 An expansive view of data mining Deriving value from a data collection by studying and understanding the structure of the constituent data

27 27 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

28 28 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

29 29 Changing nature of disease Leading causes of death in early 20 th century: Infectious diseases (e.g. tuberculosis, pneumonia, influenza), often exacerbated by public health problems (poor sanitation, inadequate nutrition, etc) By the 1950s, infectious diseases greatly diminished; treating acute illness (e.g. heart attacks, strokes) becomes the focus. Proficiency of the current medical system in delivering episodic care turns acute episodes into survivable events. New challenge: chronic conditions: illnesses and impairments expected to last a year or more, limit what one can do and may require ongoing care. In 2005, 133 million Americans lived with a chronic condition (up from 118 million in 1995). 1 CDC. 2 NIH. 3 Partnership for Solutions. [1] [2] [3]

30 30 Changing nature of disease Leading causes of death in early 20 th century: Infectious diseases (e.g. tuberculosis, pneumonia, influenza) By the 1950s, infectious diseases greatly diminished because of better public health (sanitation, nutrition, etc.) CDC

31 31 Changing nature of disease Since 50’s, treating acute illness (e.g. heart attacks, strokes) has become the focus. Proficiency of the current medical system in delivering episodic care has made acute episodes into survivable events. NIH

32 32 Changing nature of disease New challenge: chronic conditions: illnesses and impairments expected to last a year or more, limit what one can do and may require ongoing care. In 2005, 133 million Americans lived with a chronic condition (up from 118 million in 1995). Partnership for Solutions

33 33 Chronic Conditions Chronic Conditions: Making the case for ongoing care. Partnership for Solutions. Sept. 04. Activity Limitation Only Both 33 Million 5 Million 104 Million Chronic Illness Only

34 34 Technology Trends Dramatic reduction in the cost and form factor for personal storage Tremendous simplification in the technologies for capturing useful personal information

35 35 Personal Analytics

36 36 Personal Health Analytics

37 37 Personal Data Mining Charts for appropriate demographics? Optimum level for Asian Indians: 150 mg/dL (much lower than 200 mg/dL for Westerners) Distributed computation and selection across millions of nodes Privacy Enas et al. Coronary Artery Disease In Asian Indians. Internet J. Cardiology. 2001.

38 38 The Patient’s Dilemma Partnership for Solutions

39 39 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

40 40 The tyranny of choices Chris Anderson. The Long Tail. 2006. How to find something here?

41 41 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

42 42 Tools to aid creativity Bawden’s four kinds of information to aid creativity: Interdisciplinary, peripheral, speculative, exceptions and inconsistencies Intriguing work of Prof Swanson: Linking “non-interacting” literature A1: Dietary fish oils lead to certain blood and vascular changes A2: Similar changes benefit patients with Raynaud's syndrome, A1 ∩ A2 = ф. Corroborated by a clinical test at Albany Medical College Similarly, magnesium deficiency & Migraine (11 factors) ; corroborated by eight studies. Will we provide the tools? Bawden. “Information systems and the stimulation of the creativity”. Information Science 86. Swanson. “Medical literature as a potential source of new knowledge”. Bull Med Libr Assoc. 90. Litlinker@Washington

43 43 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

44 44 Education Collaboration Network Improving India’s Education System through Information Technology. IBM Report to the President of India. 2005. Low teacher-student ratios instruction material poor and often out-of-date Poorly trained teachers High student drop-out rates A hardware and a software infrastructure built on industry standards that empower teachers, educators, and administrators to collectively create, manage, and access educational material, impart education, and increase their skills  Accumulation and re-use of teaching material  Distributed, evolutionary content creation  New pedagogy: teacher as discussant Multi-lingual Teachers are able to find material that help them understand the subject matter and obtain access to teaching aids that others have found useful. Teachers also enhance the material with their own contributions that are then available to others on the network. Experts come to the class room virtually

45 45 Enabling Participation More than 3.5 million articles in 75 languages Fashioned by more than 25,000 writers 1 million articles in English (80,000 in Encyclopedia Britannica) Inspired by Wikipedia But multiple viewpoints rather than one consensus version! How to personalize search to find the material suitable for one’s own style of teaching? Management of trust and authoritativeness?

46 46 Power of human participation Theory: When a star went supernova, we would detect neutrinos about three hours before we would see the burst in the visible spectrum. Supernova 1987A: Exploded at the edge of Tarantula Nebula 168,000 years earlier. The underground Kamiokande observatory in Japan detected twenty four neutrinos in a burst lasting 13 secs on Feb 23, 1987 at 7:35 UT. Ian Shelton observed the bright light with his naked eyes at 10:00 UT in the Chilean Andes. Albert Jones in New Zealand did not see anything unusual at the Tarantula Nebula at 9:30 UT. Robert McNaught photographed the explosion at 10:30 UT in Australia. Thus a key theory explaining how universe works was confirmed thanks to two amateurs in Australia and New Zealand, an amateur trying to turn pro in Chile, and professional physicists in U.S. and Japan What’s the general platform for participation? Chris Anderson. The Long Tail. 2006.

47 47 Some Ideas Personal data mining Enable people to get a grip on their world Enable people to become creative Enable people to make contributions to society Data-driven science

48 48 Science Paradigms Thousand years ago: science was empirical describing natural phenomena Last few hundred years: theoretical branch using models, generalizations Last few decades: a computational branch simulating complex phenomena Today: data exploration (eScience) unify theory, experiment, and simulation using data management and statistics Data captured by instruments Or generated by simulator Processed by software Scientist analyzes database / files Curtsey Jim Gray, Microsoft Research. Historically, Computational Science = simulation. New emphasis on informatics: Capturing, Organizing, Summarizing, Analyzing, Visualizing

49 49 Understanding ecosystem disturbances NASA satellite data to study l How is the global Earth system changing? How does Earth system respond to natural & human-induced changes? What are the consequences of changes in the Earth system? Transformation of a non- stationary time series to a sequence of disturbance events; association analysis of disturbance regimes Vipin Kumar U. Minnesota Potter et al. “Recent History of Large-Scale Ecosystem Disturbances in North America Derived from the AVHRR Satellite Record", Ecosystems, 2005. Watch for changes in the amount of absorption of sunlight by green plants to look for ecological disasters

50 50 World Wide Telescope (Virtual Observatory) 20 Queries on Sloan Digital Sky Survey Data Q11: Find all elliptical galaxies with spectra that have an anomalous emission line. Q12: Create a grided count of galaxies with u-g>1 and r<21.5 over 60<declination<70, and 200<right ascension<210, on a grid of 2’, and create a map of masks over the same grid. Q13: Create a count of galaxies for each of the HTM triangles which satisfy a certain color cut, like 0.7u-0.5g-0.2i<1.25 && r<21.75, output it in a form adequate for visualization. Q14: Find stars with multiple measurements and have magnitude variations >0.1. Scan for stars that have a secondary object (observed at a different time) and compare their magnitudes. Q15: Provide a list of moving objects consistent with an asteroid. Q16: Find all objects similar to the colors of a quasar at 5.5<redshift<6.5. Q17: Find binary stars where at least one of them has the colors of a white dwarf. Q18: Find all objects within 30 arcseconds of one another that have very similar colors: that is where the color ratios u-g, g- r, r-I are less than 0.05m. Q19: Find quasars with a broad absorption line in their spectra and at least one galaxy within 10 arcseconds. Return both the quasars and the galaxies. Q20: For each galaxy in the BCG data set (brightest color galaxy), in 160<right ascension<170, -25<declination<35 count of galaxies within 30"of it that have a photoz within 0.05 of that galaxy. Q1: Find all galaxies without unsaturated pixels within 1' of a given point of ra=75.327, dec=21.023 Q2: Find all galaxies with blue surface brightness between and 23 and 25 mag per square arcseconds, and -10<super galactic latitude (sgb) <10, and declination less than zero. Q3: Find all galaxies brighter than magnitude 22, where the local extinction is >0.75. Q4: Find galaxies with an isophotal surface brightness (SB) larger than 24 in the red band, with an ellipticity>0.5, and with the major axis of the ellipse having a declination of between 30” and 60”arc seconds. Q5: Find all galaxies with a deVaucouleours profile (r ¼ falloff of intensity on disk) and the photometric colors consistent with an elliptical galaxy. The deVaucouleours profile Q6: Find galaxies that are blended with a star, output the deblended galaxy magnitudes. Q7: Provide a list of star-like objects that are 1% rare. Q8: Find all objects with unclassified spectra. Q9: Find quasars with a line width >2000 km/s and 2.5<redshift<2.7. Q10: Find galaxies with spectra that have an equivalent width in Ha >40Å (Ha is the main hydrogen spectral line.) Multi-Terabyte Astronomy data Most queries translate to a single SQL statement Most queries run in less than 20 seconds Gray et al. “Data Mining the SDSS SkyServer Database”, Jan 02.

51 Data Mining for Climate Data NASA ESE questions: l How is the global Earth system changing? l What are the primary forcings? l How does Earth system respond to natural & human-induced changes? l What are the consequences of changes in the Earth system? l How well can we predict future changes? l Global snapshots of values for a number of variables on land surfaces or water NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years…. http://www.nasa.gov/centers/ames/news/releases/2003/03_51AR.html Detection of Ecosystem Disturbances: An algorithm was designed to transform a non-stationary time series to a sequence of disturbance events in order to identify any significant and sustained declines in FPAR during an 18 year time period. Using this algorithm, we verified well-documented major wildfire events over the past 18 years such as those reported in East Kalimantan, Indonesia, Manitoba, Canada, and Mongolia. Nearly 400 Mha of the global land surface could be identified with at least one disturbance event over the 18-year time series. The majority of these potential disturbance events occurred in tropical savanna and shrublands or in boreal forest ecosystem classes. Disturbance regimes were further characterized by association analysis with historical climate anomalies. It was estimated that nearly 9 Pg of carbon could have been lost from the terrestrial biosphere to the atmosphere as a result of large-scale ecosystem disturbance over this 18-year time series. Disturbance Viewer This interactive module displays the locations on the earth surface where significant disturbance events have been detected. Detection of Ecosystem Disturbances: Vipin Kumar, University of Minnesota, Minneapolis http://www-users.cs.umn.edu/~kumar/nasa-umn/index.html C. Potter, P. Tan, V. Kumar, C. Kucharik, S. Klooster, V. Genovese, W. Cohen, S. Healey. "Recent History of Large-Scale Ecosystem Disturbances in North America Derived from the AVHRR Satellite Record", Ecosystems, 8(7), 808-824, November, 2005.

52 52 Some other data-driven science efforts Bioinformatics Research Network Study brain disorders and obtain better statistics on the morphology of disease processes by standardizing and cross-correlating data from many different imaging systems 100 TB/year Earthscope Study the structure and ongoing deformation of the North American continent by obtaining data from a network of multi-purpose geophysical instruments and observatories 40 TB/year Newman et al. “ Data-Intensive e-Science Frontier Research in the Coming Decade”. CACM 03.

53 53 Let’s not forget human in the loop Theory: When a star went supernova, we would detect neutrinos about three hours before we would see the burst in the visible spectrum. Supernova 1987A: Exploded at the edge of Tarantula Nebula 168,000 years earlier. The underground Kamiokande observatory in Japan detected twenty four neutrinos in a burst lasting 13 secs on Feb 23, 1987 at 7:35 UT. Ian Shelton observed the bright light with his naked eyes at 10:00 UT in the Chilean Andes. Albert Jones in New Zealand did not see anything unusual at the Tarantula Nebula at 9:30 UT. Robert McNaught photographed the explosion at 10:30 UT in Australia. Thus a key theory explaining how universe works was confirmed thanks to two amateurs in Australia and New Zealand, an amateur trying to turn pro in Chile, and professional physicists in U.S. and Japan Chris Anderson. The Long Tail. Hyperion, 2006.

54 54 Take Aways We ought to move the focus of our future work towards humane data mining (applications to benefit individuals): Personal data mining (e.g. personal health) Enable people to get a grip on their world (e.g. dealing with the long tail of search) Enable people to become creative (e.g. inventions arising from linking non-interacting scientific literature) Enable people to make contributions to society (e.g. education collaboration networks) Data-driven science (study ecological disasters, brain disorders) Rooting our future work in these applications, will lead to new data mining abstractions, algorithms, and systems (the Quest lesson)

55 55 Thank you!


Download ppt "1 The Next Frontier From a Personal Viewpoint Rakesh Agrawal Microsoft Search Labs Mountain View, CA."

Similar presentations


Ads by Google