Lesson: Data and Diabetes
All of us hear about risk factors for disease and all of us undergo medical screening tests. This lesson uses data from the National Health and Nutrition Evaluation Survey (NHANES) to illustrate how a screening test can be constructed. Link to lesson document.
The lesson can be used for all audiences to introduce and work with proportions and crossvalidation. All the computer commands are complete. The lesson provides several opportunities for students to make simple changes to the commands (e.g. changing a numerical threshold) to observe how the performance of the screening test will change.
The lesson introduces some vocabulary of medical screening tests and relates that vocabulary to specific proportions in the cross validation.
NOTE: THIS IS A DRAFT PROTOTYPE OF A LESSON POSTING, written to help us at StatPREP consider how best to structure these things. The contents of this post are correct, and you’re welcome to use the lesson, but things may change!
Discussion questions
For health sciences students and others interested in medical screening, the lesson provides opportunities to go into more depth.
 What is the point of matching cases and controls? Can you see in the first few rows of the table which are the cases and which the corresponding controls? For one of the rows that doesn’t appear to have a matched case/control pair, scroll down through the list to find the row of the match.
Matched case/control studies are common when the people with the disease are sampled in some setting where the disease prevalence is high, for instance in a medical clinic. This makes it possible to do studies even with diseases, unlike diabetes, where the prevalence is very low.

What is the prevalence of diabetes in the matched sampling data? Is this the same as in the
Complete
data set?In the matched sampling data, the prevalence is 50%. That’s simply because half the rows are people with diabetes and the others are people without diabetes. In the
Complete
data set, there are many fewer people with diabetes than not.Unless we know that the
Complete
data set is representative of the population (e.g. because it was created by random sampling), there’s no way to know whether the prevalence inComplete
is itself representative of the population.To do the prevalence calculation, you can use a statement like the following in any of the computation windows in the lesson.
Matched %>%
df_props( ~ diabetes)
 What is the “right” balance between falsepositive and falsenegative rates? Does it depend on the disease? How so?
 Building on the previous question, why isn’t best a simple matter of making the accuracy as high as possible?
 The data table
Complete
has many more cases in it. Compare the accuracy you get when evaluating a test on theMatched
data and on theComplete
data. When you use the same threshold throughout do you get the same accuracy on the two data sets? Are they even similar? What does the prevalence have to do with it?  Now do the same but with the sensitivity as the measure of the test’s performance. Are the sensitivities as estimated from
Matched
and fromComplete
similar to one another?  Show, if you like, how to compute the accuracy, sensitivity, and specificity by hand from the table of counts.
 Discuss sensitivity and specificity of mammography. Lead up to an important question for helping patients understand the results of such tests: If the test result is positive, what is the probability that the patient has the disease?
 Are different screening tests appropriate for the different sexes or races or ages? What would you look for in the sensitivity and specificity or prevalence to see that this might be so?
Engaging Statistical practice
In the traditional statistics curriculum, the cross tabulation might be used to illustrate the chisquared test or the difference between two proportions.
Statistical techniques more aligned with contemporary practice focus on ratios and confidence intervals on those ratios rather than on differences in proportions: risk ratios, odds ratios, and confidence intervals on these ratios.
Related lessons
[None yet]
Prospective lessons
This section is mainly intended for the developers of lessons and points to topics that can be covered with the same data.
 Risk ratios
 Odds ratios
 Bayesian inversion, e.g. given a positive test, what’s the probability of that a patient has the disease.