Behind the Scenes: transforming traditional to data-centric. Part I
It seems so sensible to say that statistics should be data-centric. But traditionally statistics has been taught as an extension of mathematics. Data didn’t enter into it except as fodder for exercises and test questions.
In this Behind the Scenes series, we’ll look at some examples of turning traditional lessons into data-centric ones. But note: The best lessons start with data or with a process for collecting it. We are violating that advice here only to help develop the ability to recognize some differences between traditional and data-centric approaches.
Our traditional lesson here was contributed by Kathryn Kozak, Coconino Community College, who is one of the StatPREP team leaders. The original lesson is written here in non-italics, but our comments on it are in italics. Our data-centric translation is described in this blog post and the lesson itself, in a form appropriate for in-class use, can be reached through this link.
Diagnostic tests of medical conditions can have several types of results. The test result can be positive or negative, whether or not a patient has the condition. A positive test (+) indicates that the patient has the condition. A negative test (-) indicates that the patient does not have the condition. Remember, a positive test does not prove that the patient has the condition. Additional medical work may be required. Consider a random sample of 200 patients, some of whom have a medical condition and some of whom do not.
There are both good and bad aspects to this introduction. First, the setting, medical diagnosis, is relevant to just about everyone, as is the idea of false positives and false negatives. People teaching hypothesis testing may find motivation in the analogy between false positives and Type I error, and similarly between false negatives and Type II errors. But be careful: Type I error refers to an assumed hypothesis: the null. We don’t properly speak of the null hypothesis being true. But a patient can definitively not have a disease.
The use of the word “condition” is problematical. In the paragraph, the word refers to a state of health, but in the body of the problem, the probabilities presented are “conditional probabilities”. This is unnecessarily confusing, since the word “disease” can be used to refer to the state of health.
Another problem has to do with verisimilitude: having a problem that is realistic. Make the problem about some definite disease, not some vague “medical condition.” Then you can use numbers generated according to the observed patterns, e.g. prevalence, of that disease. (In the “data” used in the problem, the prevalence is 130/200, which is unusually high for something called a disease. Many people would call a state of being of 130 out of 200 people as “normal,” not disease.) And then there is the “random sample of 200 patients.” Some description should be given of a plausible sampling protocol. An opportunity is being lost here to teach something about how medicine and medical research is done. In the data-centric version of this lesson, the data are described as coming from a matched sampling, as would be done in a real case-control study. That context also explains the “prevalence” of 50%, since for every person in the sample with the disease there is a matched person without the disease.
Results of a new diagnostic test for the condition are shown.
|Condition present||Condition absent||Row total|
|Test result +||110||20||130|
|Test result –||20||50||70|
Notice that the numbers are all multiples of 10. A student may not know whether that’s an important feature of the table. If they know it’s not, then the table sends a signal of being made up. Real data is more compelling than made-up data. Realistic made-up data is more compelling than obviously made-up data.
Traditional instructors might see the table as being “data.” But think about where these numbers might have come from. In the real world, most likely they came from a table where the unit of observation is an individual person, and the variables include whether the person has the disease, what the person’s test result was, and almost certainly additional covariates such as age, sex, etc.
To help students think in a data-centric way, they should be taught to identify what is the unit of observation, what are the variables, and whether the variables are quantitative or categorical. Presenting the above table as “data” nullifies any opportunity to reinforce data skills. In the data-centric version of the lesson, we start with a real data table from an actual survey that includes many covariates of each person. The table in given in the original problem is really a presentation of data in much the same way that a graphic is a presentation.
Assume the sample is representative of the entire population.
Earlier in the problem it was stated that the numbers come from “patients.” Why would patients be representative of the entire population? Again, a lack of verisimilitude.
For a person selected at random, compute the following probabilities:
a. P( + | condition present ); this is known as the sensitivity of the test.
b. P( – | condition present ); this is known as the false-negative rate.
Not really. The false-negative rate is P(- and condition present).
c. P( – | condition absent); this is known as the specificity of the test.
d. P( + | condition absent); this is known as the false-positive rate.
e. P(condition present and +); this is the predictive value of the test.
Not quite. There is no quantity named “predictive value” that applies to medical diagnosis tests. P(condition present | +) is the “positive predictive value”.
f. P(condition present and -).
These quantities can be computed by counting in the original data set. That’s an operation most people understand, even if they don’t understand the conditional probability notation. It would be better to say that “the right side refers to the subgroup of people and we the proportion of people among that subgroup who have what’s on the left side of the |.
The translated lesson uses software functions
df_props() to find the counts or proportions. The formula (
test_result ~ diabetes) used as an argument to those functions reflects the notation used in (a)-(d).
Pedagogical opinions differ and another reasonable choice for the calculation would be to extract the subset of the data corresponding to the right-hand side of |, and then count the number who fit with the left-hand side of |.
Much more important is that the lesson reflect real data practices. Seeing that the original data contain age and gender helps students to assimilate the practice of recording covariates and stimulates the belief that these might actually have to do with the problem at hand. Certainly in the familiar sorts of medical screening for diseases such as breast cancer or prostate cancer, age is an important criterion shaping the recommendations about who should or should not get screened. And a conversation about who should not be screened, and why not, would better help our students deal with the medical realities of their and their families’ lives.