Renata Benda Prokeinova Department of Statistics and Operation Research FEM SUA in Nitra.

Slides:



Advertisements
Similar presentations
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Advertisements

Class Session #2 Numerically Summarizing Data
The mean for quantitative data is obtained by dividing the sum of all values by the number of values in the data set.
QUANTITATIVE DATA ANALYSIS
Statistics for Decision Making Descriptive Statistics QM Fall 2003 Instructor: John Seydel, Ph.D.
The goal of data analysis is to gain information from the data. Exploratory data analysis: set of methods to display and summarize the data. Data on just.
Week 9 Data Mining System (Knowledge Data Discovery)
Data Mining By Archana Ketkar.
Introduction to Educational Statistics
Today: Central Tendency & Dispersion
Department of Quantitative Methods & Information Systems
Describing distributions with numbers
Objective To understand measures of central tendency and use them to analyze data.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
What is statistics? STATISTICS BOOT CAMP Study of the collection, organization, analysis, and interpretation of data Help us see what the unaided eye misses.
With Statistics Workshop with Statistics Workshop FunFunFunFun.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.
1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
Statistics Recording the results from our studies.
Descriptive Statistics
Why Is It There? Getting Started with Geographic Information Systems Chapter 6.
Introduction to Descriptive Statistics Objectives: 1.Explain the general role of statistics in assessment & evaluation 2.Explain three methods for describing.
Chapter 8 Quantitative Data Analysis. Meaningful Information Quantitative Analysis Quantitative analysis Quantitative analysis is a scientific approach.
Foundations of Sociological Inquiry Quantitative Data Analysis.
By: Amani Albraikan 1. 2  Synonym for variability  Often called “spread” or “scatter”  Indicator of consistency among a data set  Indicates how close.
Chapter 21 Basic Statistics.
Skewness & Kurtosis: Reference
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Copyright © 2014 by Nelson Education Limited. 3-1 Chapter 3 Measures of Central Tendency and Dispersion.
INVESTIGATION 1.
Experimental Research Methods in Language Learning Chapter 9 Descriptive Statistics.
To be given to you next time: Short Project, What do students drive? AP Problems.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Central Tendency & Dispersion
L643: Evaluation of Information Systems Week 13: March, 2008.
Chapter SixteenChapter Sixteen. Figure 16.1 Relationship of Frequency Distribution, Hypothesis Testing and Cross-Tabulation to the Previous Chapters and.
Statistical Analysis Quantitative research is first and foremost a logical rather than a mathematical (i.e., statistical) operation Statistics represent.
Describing Data Descriptive Statistics: Central Tendency and Variation.
Basic Statistics Six Sigma Foundations Continuous Improvement Training Six Sigma Foundations Continuous Improvement Training Six Sigma Simplicity.
Outline of Today’s Discussion 1.Displaying the Order in a Group of Numbers: 2.The Mean, Variance, Standard Deviation, & Z-Scores 3.SPSS: Data Entry, Definition,
1 Chapter 10: Describing the Data Science is facts; just as houses are made of stones, so is science made of facts; but a pile of stones is not a house.
Descriptive Statistics Research Writing Aiden Yeh, PhD.
Data Mining. Overview the extraction of hidden predictive information from large databases Data mining tools predict future trends and behaviors, allowing.
Descriptive Statistics(Summary and Variability measures)
3/13/2016 Data Mining 1 Lecture 2-1 Data Exploration: Understanding Data Phayung Meesad, Ph.D. King Mongkut’s University of Technology North Bangkok (KMUTNB)
Statistics Josée L. Jarry, Ph.D., C.Psych. Introduction to Psychology Department of Psychology University of Toronto June 9, 2003.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
1 By maintaining a good heart at every moment, every day is a good day. If we always have good thoughts, then any time, any thing or any location is auspicious.
Why Is It There? Chapter 6. Review: Dueker’s (1979) Definition “a geographic information system is a special case of information systems where the database.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Chapter 4: Measures of Central Tendency. Measures of central tendency are important descriptive measures that summarize a distribution of different categories.
© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 10 Descriptive Statistics Numbers –One tool for collecting data about communication.
Introduction BIM Data Mining.
Analysis and Empirical Results
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
Data Mining: EXPLORING DATA
Data Mining: Concepts and Techniques
Slides to accompany Weathington, Cunningham & Pittenger (2010), Statistics Review (Appendix A) Bring all three text books Bring index cards Chalk? White-board.
Description of Data (Summary and Variability measures)
Univariate Descriptive Statistics
Univariate Descriptive Statistics
MEASURES OF CENTRAL TENDENCY
Introduction to Statistics
Basic Statistical Terms
DATA MINING.
Advanced Algebra Unit 1 Vocabulary
Presentation transcript:

Renata Benda Prokeinova Department of Statistics and Operation Research FEM SUA in Nitra

 Data are any facts, numbers, or text that can be processed by a computer.  Today, organizations are accumulating vast and growing amounts of data in different formats and different databases.  The patterns, associations, or relationships among all this data can provide information.

 Information can be converted into knowledge about historical patterns and future trends. For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior.

Dramatic advances in data capture, processing power, data transmission, and storage capabilities are enabling organizations to integrate their various databases into data warehouses. Data warehousing is defined as a process of centralized data management and retrieval. Data warehousing, like data mining, is a relatively new term although the concept itself has been around for years. Data warehousing represents an ideal vision of maintaining a central repository of all organizational data. Centralization of data is needed to maximize user access and analysis. Technological advances are making this vision a reality for many companies. And, equally advances in data analysis software are allowing users to access this data freely. The data analysis software is what supports data mining.

Basic view  tons of data is collected, then quant wizards work their arcane magic, and then they know all of this amazing stuff  tells us about very large and complex data sets, the kinds of information that would be readily apparent about small and simple things.

 is a means of automating part this process to detect interpretable patterns  Discovering information from data takes two major forms: description and prediction  At the scale we are talking about, it is hard to know what the data shows.

Data mining is used to simplify and summarize the data in a manner that we can understand, and then allow us to infer things about specific cases based on the patterns we have observed.

A company wants to launch an advertising campaign for a product. Among its present customers the company wants to post product information to those with a high probability of purchasing the product. The company has data describing the past customer behaviour and personal data about each of its customers. There are also customers who have already bought the product, e.g. in a trial period. The customers of the trial period are divided into two classes: those who have bought the product and those who have not. With this data a prediction model is created to predict the probability of purchasing the product. After that the probability of purchasing the product is predicted for all other customers. Only those with a higher probability are addressed. As a side effect the company learns with this data mining analysis which are the relevant driver attributes of its customers buying a specific product (or at least being very interested in it).

Analysis local buying patterns story: They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered information in various ways to increase revenue. For example, they could move the beer display closer to the diaper display. And, they could make sure beer and diapers were sold at full price on Thursdays.

 WalMart is pioneering massive data mining to transform its supplier relationships. WalMart captures point-of-sale transactions from over 2,900 stores in 6 countries and continuously transmits this data to its massive 7.5 terabyte Teradata data warehouse. Teradata  WalMart allows more than 3,500 suppliers, to access data on their products and perform data analyses. These suppliers use this data to identify customer buying patterns at the store display level. They use this information to manage local store inventory and identify new merchandising opportunities.

 Data.Mining.Fox can help in marketing to predict the purchase probability of customers for a specific product.  Easy.Data.Mining can add value by being profitably applied to marketing challenges.

 Classes  Clusters  Associations  Sequential patterns

 Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.

 Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.

 Data can be mined to identify associations. The beer-diaper example is an example of associative mining

 Data is mined to anticipate behaviour patterns and trends.  For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes.

 The most common and important form of analysis is statistical analysis of data, if the data are metric and quantitative in nature.  The number of respondents in a quantitative Market Research projects can often be over a thousand, making a large bunch of data points within the data set.

After the data collected is cleaned for illogical responses and missing responses, a next good step would be tabulating and cross-tabulating respondents’ answers for all the questions. The cross-tabulating based on various segments such as demographic segments and other segments are useful in validating the responses and making sense of the data. Another useful statistic is the mean or the average. The average response for all the respondents or a cluster within the sample is good starting summary of the data. For example, if the market research project is about understanding the ability to pay for a certain product, then the average income of the respondents could be the very first statistic that would give a sense of the data. For instance, the average income of 1000 respondents is $1000. The average for various clusters can then be calculated. Cluster averages such as average income of male and female, average income of various age groups, average income of various geographic locations, etc. would give a better picture of the situation at hand.

 The mode is the value that has the maximum number of occurrences.  The mode represents the highest peak of the normal distribution curve. This means that the normal distribution curve highest point will correspond to the value of mode. The mode is a good measure of location of data when the variable is categorical.

 The middle value of ranked data is the median. If the number of data points is even, the median is calculated by taking the average of the two middle values.  There are 50% of values larger than the median in the data set and 50% lesser than the median. Therefore, the median is the 50 th percentile. The median is a good measure of central tendency for ordinal data.

 The range measures the spread of the data. The spread is the distance between or the gap between the largest and smallest value.  Thus, the range will be directly affected by outliers. Therefore, it is advisable to remove outliers by using box plot or any other tool before any statistical analysis.

 The difference between the mean and an observed value is called as a deviation from the mean.  The variance is the average of the square of the deviations from the mean for all the values.  The variance is always a positive figure. If the data points are clustered closely around the mean, the variance is small. If the data points are scattered dispersedly around the mean, the variance is large

 The standard deviation is the square root of variance.

The histogram is a summary graph showing a count of the data points falling in various ranges. The effect is a rough approximation of the frequency distribution of the data.

 The groups of data are called classes, and in the context of a histogram they are known as bins, because one can think of them as containers that accumulate data and "fill up" at a rate equal to the frequency of that data class.  The histogram of the frequency distribution can be converted to a probability distribution by dividing the tally in each group by the total number of data points to give the relative frequency.

 rs/03/Quiz/Quiz.swf rs/03/Quiz/Quiz.swf