APPLIED DATA ANALYSIS/ANALYTICS using STATA M.A.Isiaka FCMA, ACA, CIIA, ANIMN, PhD Department of Economics, Accounting & Finance College of Management Sciences Bells University of Technology, Ota, Ogun State, Nigeria.
STATA Environment
Inspecting and Describing Data Load Data Set-1 Use list command to display the two variables (price & quantity) Use browse command to open spreadsheet of the data in a new window.
Examine Descriptive Statistics Use summarize command to display number of observations, mean, standard deviation, minimum, maximum.
Plot of the Data Use plot command to display y- variable(Vertical axis) and x- variable(Horizontal axis).
Generating new Variables Use generate command to create log of the variables. Use list command to view all the new variables. What are your observations? Use plot command to view the relationship between the log variables. Repeat the plot using: gr7 command Scatter command Line command twoway connect Comment briefly on the results.
Basic Regression Use regress command with robust option on the base variables and the transformed variables. Use predict command to obtained the estimated values of the dependent variable. Display the original and estimated values of the dependent variables.
Analysis of Survey Data Load the census dataset Codebook Idunique identifier FullnameFirstname LASTNAME Ageage in years Gender1=male, 0=female Smoke0=non smoker, 1-smoker BloodtypeA,B,O, AB RaceRace (white, black, Hispanic, other) Weightweight in kg Heightheight in centimeters Diabetes0=no diabetes, 1=diabetes Hand0=right handed, 1=left handed Dentistnumeric, number of visits Use describe command examine the nature of the variables. How many respondents?
Examine Box Plot Use graph box with by() option to examine the distribution of weight for male and female respondents. Do the same for height. Generate bmi as weight divided by square of height multiply by Based on box plot, which gender has higher bmi.
Constructing Histograms Construct histogram of BMI frequency by smoke status. Are the BMI symmetrical for both smokers and nonsmokers? Using histogram and box plot, examine the age distribution of the respondents.
Further Practices based on Dataset-3. How many observations are in this dataset? How many variables are in this dataset?
Further Practices based on Dataset-3.. Using the information from describe and from the data browser please identify the type of each variable.
Further Practices based on Dataset-3... Generally apgar scores 3 or below are considered critical, scores 4-6 are considered low and scores 7+ are considered normal. Please create a new variable named status with the values 1=critical, 2=low and 3=normal. What type of variable is status? nominal categoricalordinal categoricalcontinuous numericdiscrete numeric
Further Practices based on Dataset i.What percent of babies are considered “critical” according to the apgar5 score? ii.What percent of babies are considered “normal” according to the apgar5 score? iii. What is the mean systolic blood pressure? iv. What is the median systolic blood pressure? v. What is the 90% centile of systolic blood pressure? vi.What is the range for systolic blood pressure? vii.What is the interquartile range (Q3-Q1)?
Further Practices based on Dataset a)What is the variance? b)What is the standard deviation? c)What is the Skewness? d)What is the Kurtosis? e)25% of observations have a systolic blood pressure below _______. f)What is the mean systolic blood pressure for babies with a diagnosis of germinal matrix hemorrhage? (hint: use “summarize by”, review the example of age summary by smoking status) g)What is the mean systolic blood pressure for babies without a diagnosis of germinal matrix hemorrhage?