1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring new knowledge.

1 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 1 Advanced databases – Inferring new knowledge from data(bases): Knowledge Discovery in Databases Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 15 November 2007

2 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 2 Agenda Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques

3 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 3 What is the impact of genetically modified organisms?

4 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 4 Is our school system good for immigrants and/or children from poor backgrounds?

5 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 5 What are the effects of teaching in English at universities?

6 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 6 What makes people happy?

7 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 7 What do men and women like?

8 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 8 Is this a man or a woman? clicked on

9 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 9 Primary Tasks of Data Mining Primary Tasks of Data Mining Classification Deviation and change detection Summarization Clustering Dependency Modeling Regression finding the description of several predefined classes and classify a data item into one of them. maps a data item to a real-valued prediction variable. identifying a finite set of categories or clusters to describe the data. finding a compact description for a subset of data finding a model which describes significant dependencies between variables. discovering the most significant changes in the data

11 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 11 „Data mining“ and „knowledge discovery“ n (informal definition): data mining is about discovering knowledge in (huge amounts of) data n Therefore, it is clearer to speak about “knowledge discovery in data(bases)”

12 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 12 Recall: Data, information, and knowledge Data represents a fact or statement of event without relation to other things. n Ex: It is raining. Information embodies the understanding of a relationship of some sort, possibly cause and effect. n Ex: The temperature dropped 15 degrees and then it started raining. Knowledge represents a pattern that connects and generally provides a high level of predictability as to what is described or what will happen next. n Ex: If the humidity is very high and the temperature drops substantially the atmospheres is often unlikely to be able to hold the moisture so it rains. (This is from knowledge-management theory. If you want to know about wisdom, check the Web page: G. Bellinger, D. Castro, & A. Mills: Data, Information, Knowledge, and Wisdom. http://www.systems-thinking.org/dikw/dikw.htm ) http://www.systems-thinking.org/dikw/dikw.htm

13 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 13 Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes n Data collection and data availability l Automated data collection tools, database systems, Web, computerized society n Major sources of abundant data l Business: Web, e-commerce, transactions, stocks, … l Science: Remote sensing, bioinformatics, scientific simulation, … l Society and everyone: news, digital cameras, We are drowning in data, but starving for knowledge! “Necessity is the mother of invention” — Data mining — Automated analysis of massive data sets

14 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 14 Background: Evolution of Database Technology 1960s: n Data collection, database creation, IMS and network DBMS 1970s: n Relational data model, relational DBMS implementation 1980s: n RDBMS, advanced data models (extended-relational, OO, deductive, etc.) n Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: n Data mining, data warehousing, multimedia databases, and Web databases 2000s n Stream data management and mining n Data mining and its applications n Web technology (XML, data integration) and global information systems

15 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 15 The KDD process The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data - Fayyad, Platetsky-Shapiro, Smyth (1996) non-trivial process Multiple process valid Justified patterns/models novel Previously unknown useful Can be used understandable by human and machine

16 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 16 The process part of knowledge discovery CRISP-DM CRoss Industry Standard Process for Data Mining a data mining process model that describes commonly used approaches that expert data miners use to tackle problems.

17 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 17 Knowledge discovery, machine learning, data mining n Knowledge discovery = the whole process n Machine learning the application of induction algorithms and other algorithms that can be said to „learn.“ = „modeling“ phase n Data mining l sometimes = KD, sometimes = ML

18 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 18 The KDD Process Data organized by function Create/select target database Select sampling technique and sample data Supply missing values Normalize values Select DM task (s) Transform to different representation Eliminate noisy data Transform values Select DM method (s) Create derived attributes Extract knowledge Find important attributes & value ranges Test knowledge Refine knowledge Query & report generation Aggregation & sequences Advanced methods Data warehousing 1 2 3 4 5

20 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 20 Main Contributing Areas of KDD Databases Store, access, search, update data (deduction) Statistics Infer info from data (deduction & induction, mainly numeric data) Machine Learning Computer algorithms that improve automatically through experience (mainly induction, symbolic data) KDD [data warehouses: integrated data] [OLAP: On-Line Analytical Processing]

21 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 21 Data Mining: Classification Schemes General functionality n Descriptive data mining n Predictive data mining Different views lead to different classifications n Data view: Kinds of data to be mined n Knowledge view: Kinds of knowledge to be discovered n Method view: Kinds of techniques utilized n Application view: Kinds of applications adapted

22 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 22 Data Mining: Confluence of Multiple Disciplines Data Mining Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

23 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 23 Why Not Traditional Data Analysis? Tremendous amount of data n Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data n Micro-array may have tens of thousands of dimensions High complexity of data n Data streams and sensor data n Time-series data, temporal data, sequence data n Structure data, graphs, social networks and multi-linked data n Heterogeneous databases and legacy databases n Spatial, spatiotemporal, multimedia, text and Web data n Software programs, scientific simulations New and sophisticated applications

25 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 25 Data Mining: On What Kinds of Data? Database-oriented data sets and applications n Relational database, data warehouse, transactional database Advanced data sets and advanced applications n Data streams and sensor data n Time-series data, temporal data, sequence data (incl. bio-sequences) n Structure data, graphs, social networks and multi-linked data n Object-relational databases n Heterogeneous databases and legacy databases n Spatial data and spatiotemporal data n Multimedia database n Text databases n The World-Wide Web

26 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 26 Data Mining Functionalities Multidimensional concept description: Characterization and discrimination n Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association, correlation vs. causality n Diaper  Beer [0.5%, 75%] (Correlation or causality?) Classification and prediction n Construct models (functions) that describe and distinguish classes or concepts for future prediction l E.g., classify countries based on (climate), or classify cars based on (gas mileage) n Predict some unknown or missing numerical values

27 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 27 Data Mining Functionalities (2) Cluster analysis n Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns n Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis n Outlier: Data object that does not comply with the general behavior of the data n Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis n Trend and deviation: e.g., regression analysis n Sequential pattern mining: e.g., digital camera  large SD memory n Periodicity analysis n Similarity-based analysis Other pattern-directed or statistical analyses

28 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 28 Are All the “Discovered” Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting n Suggested approach: Human-centered, query-based, focused mining Interestingness measures n A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures n Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. n Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

29 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 29 Find All and Only Interesting Patterns? Find all the interesting patterns: Completeness n Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? n Heuristic vs. exhaustive search n Association vs. classification vs. clustering Search for only interesting patterns: An optimization problem n Can a data mining system find only the interesting patterns? n Approaches l First general all the patterns and then filter out the uninteresting ones l Generate only the interesting patterns—mining query optimization

30 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 30 Other Pattern Mining Issues Precise patterns vs. approximate patterns n Association and correlation mining: possible find sets of precise patterns l But approximate patterns can be more compact and sufficient l How to find high quality approximate patterns?? n Gene sequence mining: approximate patterns are inherent l How to derive efficient approximate pattern mining algorithms?? Constrained vs. non-constrained patterns n Why constraint-based mining? n What are the possible kinds of constraints? How to push constraints into the mining process?

31 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 31 Data Mining Query Languages Automated vs. query-driven? n Finding all the patterns autonomously in a database?—unrealistic because the patterns could be too many but uninteresting Data mining should be an interactive process n User directs what to be mined Users must be provided with a set of primitives to be used to communicate with the data mining system Incorporating these primitives in a data mining query language n More flexible user interaction n Foundation for design of graphical user interface n Standardization of data mining industry and practice

32 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 32 Primitives that Define a Data Mining Task Task-relevant data Type of knowledge to be mined Background knowledge Pattern interestingness measurements Visualization/presentation of discovered patterns

33 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 33 Primitive 1: Task-Relevant Data Database or data warehouse name Database tables or data warehouse cubes Condition for data selection Relevant attributes or dimensions Data grouping criteria

34 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 34 Primitive 2: Types of Knowledge to Be Mined Characterization Discrimination Association Classification/prediction Clustering Outlier analysis Other data mining tasks

35 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 35 Primitive 3: Background Knowledge A typical kind of background knowledge: Concept hierarchies Schema hierarchy n E.g., street < city < province_or_state < country Set-grouping hierarchy n E.g., {20-39} = young, {40-59} = middle_aged Operation-derived hierarchy n email address: hagonzal@cs.uiuc.eduhagonzal@cs.u login-name < department < university < country Rule-based hierarchy n low_profit_margin (X) <= price(X, P 1 ) and cost (X, P 2 ) and (P 1 - P 2 ) < $50

36 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 36 Primitive 4: Pattern Interestingness Measure Simplicity e.g., (association) rule length, (decision) tree size Certainty e.g., confidence, P(A|B) = #(A and B)/ #(B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. Utility potential usefulness, e.g., support (association), noise threshold (description) Novelty not previously known, surprising (used to remove redundant rules, e.g., Illinois vs. Champaign rule implication support ratio)

37 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 37 Primitive 5: Presentation of Discovered Patterns Different backgrounds/usages may require different forms of representation n E.g., rules, tables, crosstabs, pie/bar chart, etc. Concept hierarchy is also important n Discovered knowledge might be more understandable when represented at high level of abstraction n Interactive drill up/down, pivoting, slicing and dicing provide different perspectives to data Different kinds of knowledge require different representation: association, classification, clustering, etc.

38 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 38 Architecture: Typical Data Mining System data cleaning, integration, and selection Database or Data Warehouse Server Data Mining Engine Pattern Evaluation Graphical User Interface Knowl edge- Base Database Data Warehouse World-Wide Web Other Info Repositories

39 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 39 Major Issues in Data Mining Mining methodology n Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web n Performance: efficiency, effectiveness, and scalability n Pattern evaluation: the interestingness problem n Incorporation of background knowledge n Handling noise and incomplete data n Parallel, distributed and incremental mining methods n Integration of the discovered knowledge with existing one: knowledge fusion User interaction n Data mining query languages and ad-hoc mining n Expression and visualization of data mining results n Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts n Domain-specific data mining & invisible data mining n Protection of data security, integrity, and privacy

41 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 41 Data General patterns Examples Cancerous Cell Data Classification “What factors determine cancerous cells?” Classification Algorithm Mining Algorithm - Rule Induction - Decision tree - Neural Network

42 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 42 If Color = light and Tails = 1 and Nuclei = 2 Then Healthy Cell (certainty = 92%) If Color = dark and Tails = 2 and Nuclei = 2 Then Cancerous Cell (certainty = 87%) Classification: Rule Induction “What factors determine a cell is cancerous?”

43 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 43 Color = darkColor = light healthy Classification: Decision Trees #nuclei=1#nuclei=2 #nuclei=1#nuclei=2 #tails=1#tails=2 cancerous healthy #tails=1#tails=2 cancerous

44 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 44 Healthy Cancerous “What factors determine a cell is cancerous?” Classification: Neural Networks Color = dark # nuclei = 1 … # tails = 2

45 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 45 “Are there clusters of similar cells?” Light color with 1 nucleus Dark color with 2 tails 2 nuclei 1 nucleus and 1 tail Dark color with 1 tail and 2 nuclei Clustering

46 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 46 Task: Discovering association rules among items in a transaction database. An association among two items A and B means that the presence of A in a record implies the presence of B in the same record: A => B. In general: A 1, A 2, … => B Association Rule Discovery Association Rule Discovery

47 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 47 “Are there any associations between the characteristics of the cells?” If color = light and # nuclei = 1 then # tails = 1 (support = 12.5%; confidence = 50%) If # nuclei = 2 and Cell = Cancerous then # tails = 2 (support = 25%; confidence = 100%) If # tails = 1 then Color = light (support = 37.5%; confidence = 75%) Association Rule Discovery Association Rule Discovery

48 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 48 Genetic Algorithms Statistics Bayesian Networks Rough Sets Time Series Many Other Data Mining Techniques Text Mining

49 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 49 A goal: From databases to deductive databases to inductive databases n A deductive database system is a database system which can make deductions (ie: conclude additional facts) based on rules and facts stored in the (deductive) database. n inductive databases l contain not only data, but also patterns. l In an IDB, inductive queries can be used to generate (mine), manipulate, and apply patterns. l The IDB framework supports the process of knowledge discovery in databases (KDD): –the results of one (inductive) query can be used as input for another –nontrivial multi-step KDD scenarios can be supported, rather than just single data mining operations.

50 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 50 Next lecture Motivation: Application examples The process of knowledge discovery Origins and context Major issues in knowledge discovery A short overview of key techniques Deductive databases

51 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 51 References / background reading; acknowledgements n Knowledge discovery is now an established area with some excellent general textbooks. I recommend the following as examples of the 3 main perspectives: l a databases / data warehouses perspective: Han, J. & Kamber, M. (2001). Data Mining: Concepts and Techniques. San Francisco,CA: Morgan Kaufmann. http://www.cs.sfu.ca/%7Ehan/dmbook http://www.cs.sfu.ca/%7Ehan/dmbook l a machine learning perspective: Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.htmlhttp://www.cs.waikato.ac.nz/%7Eml/weka/book.html l a statistics perspective: Hand, D.J., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. Cambridge, MA: MIT Press. http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2 http://mitpress.mit.edu/catalog/item/default.asp?tid=3520&ttype=2 n pp. 9, 15, 18, 20, 41-44 were taken from l Tzacheva, A.A. (2006). SIMS 422. Knowledge Inference Systems & Applications. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewI.ppt n pp. 45-48 were taken from l Tzacheva, A.A. (2006). Knowledge Discovery and Data Mining. http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt http://faculty.uscupstate.edu/atzacheva/SIMS422/OverviewII.ppt n pp. 13, 14, 22, 23, 25-39 were taken from l Han, J. & Kamber, M. (2006). Data Mining: Concepts and Techniques — Chapter 1 — Introduction. http://www.cs.sfu.ca/%7Ehan/bk/1intro.ppthttp://www.cs.sfu.ca/%7Ehan/bk/1intro.ppt

52 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 52 Picture credits; CRISP-DM reference p. 3: http://www.siu-weeds.com/publications/Wheat_field.jpg http://www.siu-weeds.com/publications/Wheat_field.jpg p. 4: http://www.dkimages.com/discover/previews/889/30039025.JPG http://www.dkimages.com/discover/previews/889/30039025.JPG p. 5: http://www.viebahnfinearts.com/website/Pages/Photos/Furniture/Mirror%201005.jpg http://www.viebahnfinearts.com/website/Pages/Photos/Furniture/Mirror%201005.jpg p. 6: http://charles.robinsontwins.org/twinsdays_96/john/smiley.jpg http://charles.robinsontwins.org/twinsdays_96/john/smiley.jpg p. 16: http://www.palagems.com/Images/ceylon_mining.jpg, http://www.palagems.com/Images/ceylon_mining.jpg http://www.crisp-dm.org/Images/187343_CRISPart.jpg The CRISP-DM phase model can be found at http://www.crisp-dm.orghttp://www.crisp-dm.org

1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring new knowledge.

Similar presentations

Presentation on theme: "1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring new knowledge."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring new knowledge.

Similar presentations

Presentation on theme: "1 Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring new knowledge."— Presentation transcript:

Similar presentations

About project

Feedback