# Lesson: Confidence in Taxis

The “Confidence in Taxis” tutorial introduces sampling distributions in a novel context. The lesson was developed by participants in our 2017 StatPREP Summer Workshop at the College of Saint Catherine. With the assistance of Minneapolis/St.Paul hub leader Dr. Joseph Roith, the workshop group looked at a trip-by-trip data set from the New York City Taxi and Limosine Commission. They used these real-world data to create a demonstration of of a fundamental statistical concept: sampling variation. Tutorial document here

The tutorial guides students through collecting a random sample of data and computing a summary statistic. It leads students to the idea that the particular value of summary statistic from the data sample is itself somewhat random, and how to establish what “somewhat” means quantitatively.

## A standard error in stats education

In many statistics courses, conceptual access to sampling distributions is provided through the *standard error* and formulas such as s/√n for the standard deviation of the mean of a sample of size n. Traditional though this be, it is problematic. Many students are uncomfortable even with algebraic notation like s/√n. Few students understand the logic behind the derivation of the formula, so it becomes nothing more than a black box. And then there is the confusing vocabulary such as “standard deviation” and “standard error”. After all, what’s the difference between a “deviation” and an “error,” and what’s “standard” about them? What’s more, the formula-based approach can work for only a few statistics such as the sample mean or the difference between means. Regrettably, this makes the mean the starring player in the story of statistics, distracting students from thinking about what it is they want to find out from their data. (It’s always the mean, right? No!)

Increasingly, statistics educators are finding the benefits of a simulation-based approach to statistical inference. Examples from mainstream commercial publishers are in the Lock Five approach and the [Statistical Investigations] (http://ca.wiley.com/WileyCDA/WileyTitle/productCd-EHEP003487.html) book by Tintle *et al.*. There are also complete open-source textbooks, such as the Open Intro’s *Intro stats with Randomization and Simulation* and Ismay and Kim’s Modern Dive *Introduction to Statistical and Data Sciences via R*.

The StatPREP workshop participant’s lesson uses the simulation approach to find the sampling distribution of the sample *maximum* fare. There’s nary a “standard” or “deviation” or “error” in the lesson, just the basic statistical concept of random sampling and the facilities for automating this in the `mosaic`

package for R.

## Variance and bias

Yes that’s right: the summary statistic used in the lesson is the sample *maximum*. This is because the question the lesson addresses is, “What is the worst-case taxi fare for a daily five-mile journey?”

There’s a good pedagogical reason for this focus on the maximum. It’s pretty easy to imagine why, if you take a small sample from a large set of numbers, you’re unlikely to get a maximum for the sample that’s as large as the maximum in the whole set. And it’s easy to see why the sample statistic is somewhat random: it’s a matter of luck which one number in your sample happens to be the largest.

Compare this to the situation with the sample mean. The population mean is a number, not necessarily the value for any member of the population. So it’s harder to understand why the sample mean shouldn’t be right on target. (And, in fact, it is on target on average. That’s what it means to be unbiased!)

## Sample size

The last computation in the lesson displays the sampling distribution. It’s based on a sample of size n=10. The lesson asks you to try larger sampling sizes and to decide whether they improve the estimate of the worst-case taxi fare. As you’ll see, investigating the relationship between sample size and the quality of the estimate is well within student capabilities.

## Discussion questions

- How come the maximum fare computed from a sample tends to be less than the maximum fare observed in the whole data set? Does your logic extend to other statistics such as the median and mean?
- Suppose you were interested in figuring out a budget for the next year based on your sample of taxi rides. Would the maximum fare the the right choice of a statistic to calculate? Suppose you anticipated needing to take 100 taxi rides in the year. Should your budget be 100 times the maximum fare? Would this over-estimate or under-estimate your likely actual expenses? If not the maximum fare, which other statistic might be appropriate? The mean? The median?

## Instructor feedback is important!

The workshop participants who wrote this lesson used the computing and data resources made available through the workshop in order to explore a new idea. None of them had any experience teaching a lesson on the sampling distribution of the sample maximum. They tried to figure out an approach that would work for students.

Always, when you use a new lesson with students something unexpected happens. It might be some mis-conception that never occurred to you. It might be some insight or opportunity that the students see that you missed.

As you work with this lesson, it’s important to share your experience and suggestions with us. (You can do so by posting a comment in this blog entry.) That might help us improve the lesson. Or it might help us identify a good opportunity for a different lesson. And, while we like to think that all our ideas are good ones, that’s not always the case and we need to find out when it isn’t. So, whether your feedback is positive or negative, your input helps us all improve!