Home Site Map | Feedback | Glossary | About | Print | Help
Orientation
Preassessment
Algebra
Precalculus
Discrete Probability Distributions
Continuous Probability Distributions
Statistical Sampling and Regression
 Introduction Populations and Samples Statistical Estimation t-Distribution Covariance and Correlation Simple Linear Regression
Postassessment

 Statistical Sampling and Regression: Statistical Estimation

If you reviewed the topics of discrete and continuous probability distributions earlier in this course, you are already familiar with some of the concepts that will be covered in this discussion, such as mean, variance, and standard deviation. The previous sections of the course showed you how these summary measures are calculated for populations with the data presented in the form of a probability distribution. As there are different formulas for the mean of a probability distribution, a population, and a sample, it is crucial that you use the correct formula for the type of data you are measuring. This portion of the course reviews summary measures for populations and samples. This section concludes with a demonstration of a practical application of sample statistics through the development of confidence intervals.

Population Data

To demonstrate the calculation of summary measures for a population, this course will use as an example the population of employees in the accounting department of a small company. The random variable of interest is the tenure of the five employees of this accounting department. The table below lists the names of the employees along with the number of years they have worked in that department.

How does a marketing brand manager use statistics?

Population measures of central tendency
What can you infer from this population data? In its raw form, it is difficult to draw many conclusions. When it is organized you can begin to summarize information. One common way to summarize population data is by using the measures of central tendency: mode, median, and mean.

Population mode The mode is simply the value in the population that occurs most frequently. A population can have more than one mode. In the accounting department example, there are two employees who each have tenure of two years. This is the value with the highest frequency, so you can say that the modal value of tenure for this population is two.

You can use Excel to find population and sample modes.
Mode Excel Tutorial

Population median The median is the halfway point of a population, where the values of the random variable have been arranged in numerical order. If a population has an odd number of data points, the median is a single point. If the population has an even number of values, you would find the median by adding the two central values and dividing the sum by two. In the accounting department example, there are five employees or data points. Look at the table below where the employees have been arranged in ascending order of the years of tenure. The midway point is the third value, which is seven.

You can use Excel to find population and sample medians.
Median Excel Tutorial

Population mean Mean is the most commonly used measure of central tendency. If you reviewed "Discrete Probability Distributions" and "Continuous Probability Distributions" earlier in the course, you should already be familiar with the concept of mean. The formula for calculating the mean of a population is shown below.

mx = population mean of the random variable x
N = number of occurrences of the random variable x in the population
i = an instance of the random variable x
xi = the ith value of the random variable x

To see how the mean of a population is calculated, return to the accounting department example. Using the information from the population, the values for the formula variables are

mx = population mean
N = 5
xi = 2, 15, 7, 8 and 2 (tenure values)

To find the mean of the tenure of the accounting department, add together all of the tenure values, and then divide that sum by the total number of employees in the department, as shown below.

The mean tenure of the accounting department is 6.8 years. To determine the level of variability in tenure in the department, you will need to calculate the variance and standard deviation of this population.

Population measures of dispersion
Variance and standard deviation are measures of dispersion. They tell you to what degree a population's values are scattered around its mean.

You can use Excel to find population and sample means.
Mean Excel Tutorial

Population variance Variance is a measure of the spread of data around the mean. Like other summary measure formulas, the formula used to calculate variance depends on whether you are measuring a probability distribution, a sample, or a population. The formula for a population variance is shown below.

sx2 = population variance of the random variable x
mx = population mean of the random variable x
N = number of occurrences of the random variable x in the population
i = an instance of the random variable
xi = ith instance of the random variable x

To find the variance of a population, the mean of the population is subtracted from each value of x. This difference is then squared so as to avoid negative values canceling each other out. Each squared difference is then summed and that sum is divided by N. Consider the accounting department example again.

sx2 = population variance
mx = 6.8
N = 5
xi = 2, 15, 8, 7 and 2

Using the above information, the variance of this population would be calculated in the following manner.

Population standard deviation The standard deviation of a population is the square root of the population variance. Therefore, the formula for population standard deviation can be written as

Because population standard deviation is the square root of the variance, the variables in the formula are the same. (These variables were listed in the variance portion of this discussion.) The variance of the accounting department tenure was determined to be 22.96. The standard deviation of this population is the square root of this number.

The standard deviation of this population is 4.79 years. Standard deviation is a more useful measure of dispersion than variance because its unit value is in the same terms as that of the population from which it came. The variance in this example is in the terms of years squared and as such is not useful. The standard deviation can be associated with the mean to understand the general level of dispersion of a population. In this example the mean years of tenure is 6.8 years with a standard deviation of 4.79 years.

Sample Data

When a population's entire data set is available, it is not necessary to sample the data set in order to infer population parameters. Frequently, however, it is impossible to collect data from an entire population. In these cases a sample from the population may be drawn, its characteristics measured, and an estimation of the parameters of the population developed.

To understand the concepts of statistical estimation, imagine yourself as a designer of new products for a manufacturer of lightweight, waterproof hiking jackets. Earlier designs of the jackets were built with inexpensive zippers that tended to break. These zippers caused a high level of customer dissatisfaction. In the end, the money spent replacing broken zippers overshadowed the initial cost savings projected for the inexpensive zippers.

Your challenge is to find another zipper that is relatively inexpensive, yet durable. For the purposes of this example, assume that zippers are subjected to a stress test in your company's lab and that stress is quantified in terms of a "test weight." Your design specifications call for zippers with a test weight of 10; any zipper with a test weight below 7 is considered to be below specification.

You have chosen a new zipper and tested a sample of 100 to ensure that they meet the required test weights. The results from the test are presented in the frequency table below.

Your sample of 100 randomly selected zippers is large enough to be considered representative of the population of zippers that will be used for this season's jackets. A histogram of this sample illustrates the shape of the sample distribution. Since the sample size is sufficiently large, the histogram provides an approximate estimation of the shape of the population distribution.

Sample measures of central tendency
Sample measures of central tendency—mode, median, and mean—help business analysts and researchers to understand the population from which the sample was drawn.

Sample mode Recall that the mode is simply the value in the distribution that occurs most frequently. In the zipper sample, the test weights 12 and 13 are the modes because they occur 9 times each, which is more often than any other value.

Sample median Recall that the median is the halfway point of a distribution that is arranged in numerical order. If the distribution has an odd number of data points, the median is a single point. In the zipper sample, there are 100 data points in all. The midway point is between the fiftieth and fifty-first value, both of which equal 12. To find the median, you would add 12 to 12 and divide the result in half. The result is 12.

Sample mean Recall that the mean, the most commonly used measure of central tendency, is simply the average of the values of a random variable in a sample. The formula used to calculate sample mean is quite similar to the one used to calculate the mean of a population. However, two variables are different. The first, the variable which denotes the sample mean, is used in place of m, the variable for population mean. The second difference, n, the number of occurrences of the random variable x in the sample, is used in place of the population variable N, the number of occurrences of the random variable x in the population.

The formula for calculating the mean of a sample is shown below.

You can use Excel to find population variance and standard deviation.
Population Variance and Standard Deviation Excel Tutorial

= sample mean of the random variable x
n = number of occurrences of the random variable x in the sample
i = an instance of the random variable
xi = the ith instance of the random variable x

To see how the mean of a sample is calculated, return to the example of jacket zippers discussed earlier. Using the information from the sample distribution (shown again in the table on the right), the values for the formula variables are

= sample mean
n = 100
xi = 7.5, 8.0, 8.5, . . . , 16.5 (test weights)

To find the mean of the sample of zippers, add together all of the test weights, and then divide that sum by the total number of observations in the sample, as shown below.

The mean of this sample is a test weight of 12.04. The sample data in this example is presented as clustered data. There are formulas to calculate sample mean, variance, and standard deviation for clustered sample data.

Now you know that this average test weight of your sample meets the requirements you have established for the zippers, which is a test weight of 10. To determine whether most of the zippers meet the required test weight, calculate the variance and standard deviation of the sample.

Sample Measures of Dispersion
The measures of sample dispersion, variance, and standard deviation, tell you to what degree a distribution's values are scattered around its mean.

Sample Variance Recall that variance is a measure of the spread of data around the mean. The formula used to calculate sample variance also differs slightly from that used to calculate the population variance. The variable denoting the variance of a sample s is used in place of s, the variable for population variance. Another difference in the formula for sample variance is the (n – 1) term in the denominator. This is an adjustment used in sample statistics to account for the fact that the sample is only a subset of the entire population. The formula for a sample variance is shown below.

sx2 = sample variance of the random variable x
= sample mean of the random variable x
n = number of occurrences of the random variable x in the sample
i = an instance of the random variable x
xi = the ith instance of the random variable x

To find the variance of a sample, the mean of the sample is subtracted from each value of x. This difference is then squared so as to avoid negative values. Each squared difference is then summed and that sum is divided by
n – 1. Consider the zipper example again.

sx2 = sample variance
xi = 7.5, 8.0, . . . , 16.5
= 12.04
n = 100

Using the above information, the variance of this sample of zippers would be calculated in the following manner.

You can use Excel to find sample variance and standard deviation.
Sample Variance and Standard Deviation Excel Tutorial

Sample standard deviation
The standard deviation of a sample is the square root of the sample variance. Therefore, the formula for sample standard deviation can be written as

Because sample standard deviation is the square root of the sample variance, the variables in the formula are the same. (These variables were listed in the variance portion of this discussion.) The variance of the sample of zippers was determined to be 4.83. The standard deviation of this sample is the square root of this number.

The standard deviation of this sample is 2.20.

View the following animation for another business example using sample statistics.

 Sample StatisticsView animation

Confidence Intervals

You've now seen how samples can be used to estimate the population parameters of mean, variance, and standard deviation. In the example of selecting zippers for a hiking jacket, recall that the design specification called for a zipper test weight of 10; in a sample of 100 zippers, the average test weight was determined to be 12. You might wonder how accurate your sample data are in estimating the population of zippers. Can you be confident that the population mean meets your design requirements of a test weight of 10 when you have taken a sample with a mean of 12? How close is the sample mean to the true mean of the population?

A common statistical method used to determine the accuracy of a sample estimate is the construction of a confidence interval. A confidence interval identifies a range of values around the estimate, or sample parameter, that contains the true population parameter, given a specified level of confidence or probability.

Confidence intervals can be built around a number of parameters. The following information focuses on constructing a confidence interval around the mean.

The diagram below is a visual depiction of how a confidence interval works. Here, the distribution of the population approximates a normal distribution and the mean of the population is labeled m. The three notations below the distribution represent different samples that have been drawn from that population, each with a different sample mean of . The line on either side of each of the sample means represents a confidence interval that has been constructed around the mean of each sample. You may notice that in each case the true population mean of m is within the range of values contained in the confidence interval. This means that the confidence interval has served its purpose of identifying the range of values within which lies the true population parameter.

You may also notice in the diagram that the confidence interval is larger for samples 2 and 3 than it is for sample 1. The size of a confidence interval depends on the sample parameter's sample size and standard deviation. Note that confidence intervals are based on sample data and, as such, it is not possible to construct an interval with 100 percent confidence of capturing the population parameter of interest. However, it is possible to construct an interval around a sample parameter with a high confidence level. The following formula is used to construct a confidence interval for a population mean.

= the sample mean
s = sample standard deviation
n = sample size
Z = the standardized random variable for the desired level of confidence

You may recall encountering the standard random variable Z in the Standard Normal Distribution topic of the Continuous Probability Distributions section of this course. There, you standardized a normally distributed random variable as a way of quickly determining the probability associated with a range of values for that variable. When you construct confidence intervals, you again use the standard random variable— but reversed, since you now know the probability (or level of confidence) you are interested in, and you are trying to identify the value of Z associated with that probability. Many business situations require you to aim for a level of confidence that is 90 percent or above (a probability of .9 or greater). The following table provides Z scores associated with common confidence levels. Also, notice that you can look up these values in the Z distribution table provided with this course.

For the purpose of calculating a confidence interval around the sample mean in the zipper example, assume that you want to determine the interval of the true population mean with a confidence level of 95 percent.

Using the information from the zipper example and the Z distribution table, you know that

= 12.04
s = 2.20
n = 100
Z = 1.960

Now, substitute this information into the formula to calculate the lower limit of the 95 percent confidence interval.

The lower limit of the 95 percent confidence interval is 11.61. To find the upper limit, use the same equation, substituting an addition sign for the subtraction sign.

The 95 percent confidence interval is (11.61, 12.47). This means that you can be 95 percent confident that the true population mean of zipper strength is between 11.61 pounds and 12.47 pounds. In other words, the probability that the population mean will fall in this interval is .95. This is good news; because the lower limit of the confidence interval is above the design specification of a zipper test weight of 10, you are now confident that the population of zippers will meet your requirements.

If the sample were larger, the range of the 95 percent confidence interval would be smaller. In general, the larger your sample size, the more accurate the estimate of the mean becomes because you have more data representing the population. The calculation of the confidence interval involves division by the square root of the sample size, n. A bigger sample size will result in a smaller confidence interval. (There are ways to determine how big the sample size should be to guarantee a certain level of accuracy. This is an advanced topic that you may study during your MBA program.)

In statistical terms, if the sampling process were repeated and a new sample mean and confidence interval were calculated each time, 95 percent of these confidence intervals would contain the true population mean. If the zipper experiment were repeated many times, 95 percent of the calculated confidence intervals will contain the true mean. That is why you can be 95 percent confident that the interval (11.61, 12.47) contains the true mean.

Take a second look at the diagram describing how sample means with confidence intervals constructed around the mean help to estimate the true population mean. This time a couple more samples have been added—samples 4 and 5 with sample means of , respectively. Notice this time that the mean and confidence interval constructed around sample 4 does not incorporate the population mean. This is shown as a reminder that there will be times when the range of the sample mean and confidence interval do not include the true population mean when working with a 95 percent confidence interval.

1. How is the mean of a sample calculated?

2. Why is the mean of a sample calculated differently from the mean of a probability distribution?

3. How is the variance related to the standard deviation?

4. How is the standard deviation of a sample calculated differently than the standard deviation of a probability distribution?

5. The table below provides the population of daily order volumes for a recent week. Calculate the mean, variance, and standard deviation of this population.

6. Assume the data set above is not the population of orders for a recent week. Rather, these data are a random sample of daily orders drawn from the past year's data. Calculate the mean, variance, and standard deviation of this sample.

7. The table below identifies the number of consultants of a financial consulting firm for each of its five largest U.S. offices. For this data set, calculate the mode, median, mean, variance, and standard deviation.

8. A well-known manufacturer of sugarless food products has invested a great deal of time and money in developing the formula for a new kind of sweetener. Although costly to develop, this sweetener is significantly less expensive to produce than the sweeteners the manufacturer had been using. The manufacturer would like to know if the new sweetener is as good as the traditional product. The manufacturer knows that when consumers are asked to indicate their level of satisfaction with the traditional sweeteners, they respond that on average their level of satisfaction is 5.5.

The manufacturer conducts market research to determine the level of acceptance of this new product. Consumer taste acceptance data are collected from 25 consumers of sugarless products. The data collected can be seen in the table below.

For this sample, determine the mode, median, and mean values, as well as the variance and standard deviation.

9. Assume that the manufacturer of sweeteners has taken a sample of 100 consumers to determine the level of satisfaction with a new sweetener. The results of the market research are as follows: the mean level of satisfaction was found to be 5.75 and the standard deviation of satisfaction was found to be 1.1.

Using this information, construct a 95 percent confidence interval for the mean level of satisfaction in the population.

You must complete the Preassessment before you can access the rest of this course.

 Previous | Next