DATA-DRIVEN STATISTICAL RESEARCH -- By Xianghua Luo Why a statistician consultant need to do methodological research? –Help you do your consulting work better –Give yourself a closure on a study you’ve been involved in or a new method you have just learnt –Drive yourself to learn new stuff
How to find topics? –Research on how previous people analyze the same type of data. Need a lot of reading. –What can be improved? –How to convince people to use the new method you proposed? Publish! –Write for both statistical journals and scientific journals. (This is the proof that you care about their scientific problems, you understand their problems, you know their languages, etc…)
Do you need to have your own funding to support your research? –Depends. –If you need to, you will find it not that difficult to find an existing data or ongoing study that you are involved in. So, an R03/R21 on secondary analysis of an existing data/project might be a good starting point for you. –Being a PhD means you will be a PI one way or another someday. Better to practice early.
AN EXAMPLE OF DATA-DRIVEN RESEARCH Analysis of Cigarette Purchase Task Instrument Data with a Left-Censored Mixed Effects Model - A joint work with Liao W, Le C, Chu H, et al.
Cigarette Purchase Task Survey Imagine a TYPICAL DAY during which you smoke. The following questions ask how many cigarettes you would consume if they cost various amounts of money. Assume the following: Available cigarettes are your favorite brand You have the same income/savings that you have now You have NO ACCESS to any cigarettes or nicotine products other than those offered at these prices You consume the cigarettes you request on that day (in other words, no stockpiling) Participants were then asked to respond to the following set of questions: How many cigarettes would you smoke if they were_____ each?: 0¢ (free), 1¢, 5¢, 13¢, 25¢, 50¢, $1, $2, $3, $4, $5, $6, $11, $35, $70, $140, $280, $560, $1,120.
Figure. A typical cigarette demand curve for a smoker, derived from cigarette purchase task survey data (log-log coordinate used)
Existing statistical methods: –Individual-specific ordinary least square model. –Mixed effects model. How the extra zeros/missing values are handled in existing methods? –Ignore all zeros or missing values; –Impute the first zero with an arbitrary small number ω, e.g. 0.1, but ignoring further zeros; –Impute all zeros/missing values with ω.
Any problems in the existing methods? –Could the zeros be small values not observable because they are lower than a certain threshold (LOD)? Left-censored mixed effects model –What if some zeros are real zero consumptions (i.e., complete cessation of smoking)? A joint modeling approach with a logistic regression component for the cessation status and a mixed effects model for the data when the complete cessation hasn’t been achieved.
What else you can do to improve your consulting work? –Go to scientific seminars –Serve as a referee for scientific journals –Serve as a statistician reviewer in protocol/proposal review committees