Skip to content

Lesson: Data and Diabetes

All of us hear about risk factors for disease and all of us undergo medical screening tests. This lesson uses data from the National Health and Nutrition Evaluation Survey (NHANES) to illustrate how a screening test can be constructed. Link to lesson document.

The lesson can be used for all audiences to introduce and work with proportions and cross-validation. All the computer commands are complete. The lesson provides several opportunities for students to make simple changes to the commands (e.g. changing a numerical threshold) to observe how the performance of the screening test will change.

The lesson introduces some vocabulary of medical screening tests and relates that vocabulary to specific proportions in the cross validation.

NOTE: THIS IS A DRAFT PROTOTYPE OF A LESSON POSTING, written to help us at StatPREP consider how best to structure these things. The contents of this post are correct, and you’re welcome to use the lesson, but things may change!

Discussion questions

For health sciences students and others interested in medical screening, the lesson provides opportunities to go into more depth.

  • What is the point of matching cases and controls? Can you see in the first few rows of the table which are the cases and which the corresponding controls? For one of the rows that doesn’t appear to have a matched case/control pair, scroll down through the list to find the row of the match.

    Matched case/control studies are common when the people with the disease are sampled in some setting where the disease prevalence is high, for instance in a medical clinic. This makes it possible to do studies even with diseases, unlike diabetes, where the prevalence is very low.

  • What is the prevalence of diabetes in the matched sampling data? Is this the same as in the Complete data set?

    In the matched sampling data, the prevalence is 50%. That’s simply because half the rows are people with diabetes and the others are people without diabetes. In the Complete data set, there are many fewer people with diabetes than not.

    Unless we know that the Complete data set is representative of the population (e.g. because it was created by random sampling), there’s no way to know whether the prevalence in Complete is itself representative of the population.

    To do the prevalence calculation, you can use a statement like the following in any of the computation windows in the lesson.

Matched %>%  
  df_props( ~ diabetes)
  • What is the “right” balance between false-positive and false-negative rates? Does it depend on the disease? How so?
  • Building on the previous question, why isn’t best a simple matter of making the accuracy as high as possible?
  • The data table Complete has many more cases in it. Compare the accuracy you get when evaluating a test on the Matched data and on the Complete data. When you use the same threshold throughout do you get the same accuracy on the two data sets? Are they even similar? What does the prevalence have to do with it?
  • Now do the same but with the sensitivity as the measure of the test’s performance. Are the sensitivities as estimated from Matched and from Complete similar to one another?
  • Show, if you like, how to compute the accuracy, sensitivity, and specificity by hand from the table of counts.
  • Discuss sensitivity and specificity of mammography. Lead up to an important question for helping patients understand the results of such tests: If the test result is positive, what is the probability that the patient has the disease?
  • Are different screening tests appropriate for the different sexes or races or ages? What would you look for in the sensitivity and specificity or prevalence to see that this might be so?

Engaging Statistical practice

In the traditional statistics curriculum, the cross tabulation might be used to illustrate the chi-squared test or the difference between two proportions.

Statistical techniques more aligned with contemporary practice focus on ratios and confidence intervals on those ratios rather than on differences in proportions: risk ratios, odds ratios, and confidence intervals on these ratios.

Related lessons

[None yet]

Prospective lessons

This section is mainly intended for the developers of lessons and points to topics that can be covered with the same data.

  • Risk ratios
  • Odds ratios
  • Bayesian inversion, e.g. given a positive test, what’s the probability of that a patient has the disease.
No comments yet

Leave a Reply