CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)

CENG 514

Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Alternative names – Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Definition by Gartner Group “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques.”

(Deductive) query processing Expert systems or small ML/statistical programs

The Explosive Growth of Data: from terabytes to petabytes – Data collection and data availability: Automated data collection tools, database systems, Web, computerized society Data is everywhere, information is nowhere Market: From focus on product/service to focus on customer IT: From focus on up-to-date balances to focus on patterns in transactions - Data Warehouses - OLAP Increase in complexity of data

Machine Learning Database Management Artificial Intelligence Statistics Data Mining Visualization Algorithms

7 Data Mining: History of the Field Knowledge Discovery in Databases workshops started ‘89 – Now a conference under the auspices of ACM SIGKDD – IEEE conference series started 2001

CS490D8 A Brief History of Data Mining Society 1989 IJCAI Workshop on Knowledge Discovery in Databases (Piatetsky- Shapiro) – Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases – Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky- Shapiro, P. Smyth, and R. Uthurusamy, 1996) 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) – Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and SIGKDD Explorations More conferences on data mining – PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.

Market Analysis, Customer Relationships Management (CRM) Churn Analysis Risk Analysis and Management Fraud Detection, Counter Terrorism Network Intrusion Detection Web Site Restructring Recommendation Scientific Applications

10 Corporate Analysis & Risk Management Finance planning and asset evaluation – cash flow analysis and prediction – contingent claim analysis to evaluate assets – cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning – summarize and compare the resources and spending Competition – monitor competitors and market directions – group customers into classes and a class-based pricing procedure – set pricing strategy in a highly competitive market

11 Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. – Auto insurance: ring of collisions – Money laundering: suspicious monetary transactions – Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests – Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm – Anti-terrorism

12 Example: Use in retailing Goal: Improved business efficiency – Improve marketing (advertise to the most likely buyers) – Inventory reduction (stock only needed quantities) Information source: Historical business data – Example: Supermarket sales records – Size ranges from 50k records (research studies) to terabytes (years of data from chains) – Data is already being warehoused Sample question – what products are generally purchased together? The answers are in the data, if only we could see them

13 Example: Churn Analysis Business Problem: Prevent loss of customers, avoid adding churn-prone customers Solution: Use neural nets, time series analysis to identify typical patterns of telephone usage of likely- to-defect and likely-to-churn customers Benefit: Retention of customers, more effective promotions

14 Example: Clicks to Customers Business problem: 50% of Dell’s clients order their computer through the web. However, the retention rate is 0.5%, i.e. of visitors of Dell’s web page become customers. Solution Approach: Through the sequence of their clicks, cluster customers and design website, interventions to maximize the number of customers who eventually buy. Benefit: Increase revenues

15 What Can Data Mining Do? Cluster Classify – Categorical, Regression Summarize – Summary statistics, Summary rules Link Analysis / Model Dependencies – Association rules Sequence analysis – Time-series analysis, Sequential associations Detect Deviations

16 Clustering Find groups of similar data items Statistical techniques require some definition of “distance” (e.g. between travel profiles) while conceptual techniques use background concepts and logical descriptions “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob

17 Classification Find ways to separate data items into pre-defined groups Requires “training data”: Data items where group is known “Route documents to most likely interested parties” – English or non-english? – Domestic or Foreign?

18 Association Rules Identify dependencies in the data: – X makes Y likely Indicate significance of each dependency “Find groups of items commonly purchased together” – People who purchase fish are extraordinarily likely to purchase wine – People who purchase Turkey are extraordinarily likely to purchase cranberries

19 Sequential Associations Find event sequences that are unusually likely “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switch

20 Recommendation Techniques Given database of user preferences, predict preference of new user Example: – Predict what new movies you will like based on your past preferences others with similar past preferences their preferences for the new movies – Predict what books/CDs a person may want to buy (and suggest it, or give discounts to tempt customer)

21 adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Data Target Data Selection Knowledge Preprocessed Data Patterns Data Mining Interpretation/ Evaluation Knowledge Discovery in Databases: Process Preprocessing

Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration OLAP, MDA Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts Data Sources Paper, Files, Information Providers, Database Systems, OLTP

Learning the application domain – relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation – Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining – summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge

Mining methodology – Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web – Performance: efficiency, effectiveness, and scalability – Pattern evaluation: the interestingness problem – Incorporation of background knowledge – Handling noise and incomplete data – Parallel, distributed and incremental mining methods – Integration of the discovered knowledge with existing one: knowledge fusion User interaction – Data mining query languages and ad-hoc mining – Expression and visualization of data mining results – Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts – Domain-specific data mining & invisible data mining – Protection of data security, integrity, and privacy

(From J. Ullman’s Notes) A big data-mining risk is that you will “discover” patterns that are meaningless. Statisticians call it Bonferroni’s principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find meaningless results. When looking for a property make sure that the property does not allow so many possibilities that random data will surely produce facts “of interest.”

Joseph Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. He devised (something like) an experiment where subjects were asked to guess 10 hidden cards – red or blue. He discovered that almost 1 in 1000 had ESP – they were able to get all 10 right!

He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude?

He told these people they had ESP and called them in for another test of the same type. Alas, he discovered that almost all of them had lost their ESP. What did he conclude? – He concluded that you shouldn’t tell people they have ESP; it causes them to lose it.

CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)

Similar presentations

Presentation on theme: "CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)

Similar presentations

Presentation on theme: "CENG 514. Data mining (knowledge discovery from data) – Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful)"— Presentation transcript:

Similar presentations

About project

Feedback