# # Objectives 目标

• Understand the difference between Point Estimate and Interval Estimate
了解点估计和区间估计的区别
• Know how to use R to construct a confidence interval on Mean/Proportion , Difference of Two Means/Proportions, Variance and Ratio of Variances
知道如何使用 R 构建均值 / 比例、两个均值 / 比例的差值、方差和方差比的置信区间

# # Point Estimate (Not good) 点估计

[1] 0.04918487

[1] 0.203471

[1] 1.020872


Let's increase the sample size to 10000

[1] 0.005888358

[1] 0.005134332

[1] 0.9931025


Which point estimator is 'better'?

# # Interval Estimate 区间估计

## # Confidence Interval on $\mu$ when $\sigma$ known 已知总体标准差 $\sigma$，求均值 $\mu$ 的置信区间

The interval $(\bar{X}-1.96\sigma/\sqrt{n},\bar{X}+1.96\sigma/\sqrt{n})$ contains the true population mean with probability 95%, where $\bar{X}$ is the sample mean, $n$ is the sample size, and $\sigma$ is population standard deviation.

这是如何计算的？
• what does the "95% probability" mean?
“95% 的概率” 是什么意思？

According to the Central Limit Theorem, we have $\bar{X}$ is approximately normal distributed with mean $\mu$ and variance $\sigma^2/n$,

i.e.,

$Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)$

Suppose $z_{\alpha/2}$ denote the value such that

$P(-z_{\alpha/2} < Z

As confidence is 95%, we know $\alpha = 1-95\% = 0.05$, that is $\alpha/2 = 0.025$.

How to find $z_{\alpha/2}$?

[1] 1.959964


Why confidence interval better?

The $100(1-\alpha)\%$ confidence interval provides an estimate of the accuracy of our point estimate.
$100(1-\alpha)\%$ 置信区间提供了对点估计准确性的评价。

Question: How to construct 99% confidence interval?

### # In-class Exercise: CI on $\mu$ with known variance 已知总体方差，求总体均值 $\mu$ 的 置信区间

The average zinc concentration recovered from a sample of measurements taken in 36 different locations in a river is found to be 2.6 grams per milliliter.

Find the 95% and 99% confidence intervals for the mean zinc concentration in the river.

Assume that the population standard deviation is 0.3 gram per milliliter.

[1] 2.502002

[1] 2.697998

[1] 2.471209

[1] 2.728791


## # Confidence Interval on $\mu$ when $\sigma^2$ unknown 总体方差 $\sigma^2$ 未知，求总体均值 $\mu$ 的置信区间

### # If we don't know $\sigma$ 如果不知道总体标准差 $\sigma$t-distribution

If the population follows normal distribution or approximately normal distribution, we can use $t-$distribution.

If we have a random sample from a normal distribution, then the random variable

$T = \frac{\bar{X}-\mu}{S/\sqrt{n}}$

has a Student $t$-distribution with $n-1$ degrees of freedom.

Here $S$ is the sample standard deviation.

If $\bar{x}$ and $s$ are the mean and the standard deviation of a random sample of size $n$ from a $\textbf{normal}$ distribution with unknown variance $\sigma^2$, a $100(1-\alpha)\%$ confidence interval for $\mu$ is given by

$\bar{x} - t_{n-1,\alpha/2}\frac{S}{\sqrt{n}} < \mu < \bar{x} + t_{n-1,\alpha/2}\frac{S}{\sqrt{n}},$

where $t_{n-1,\alpha/2}$ is the $t-$value with $v =n-1$ degrees of freedom, leaving an area of $\alpha/2$ to the right.

### # Example 1 t.test

The contents of seven similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2, and 9.6 liters.

Find a 95% confidence interval for the mean contents of all such containers, assuming an approximately normal distribution.

[1] 9.738414

[1] 10.26159


A different way to get confidence interval

    One Sample t-test

data:  sulfruicAcid
t = 93.541, df = 6, p-value = 1.006e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
9.738414 10.261586
sample estimates:
mean of x
10


### # In-class Exercise: CI on $\mu$ with unknown variance 总体方差未知，求总体均值 $\mu$ 的置信区间

1. Regular consumption of presweetened cereals contributes to tooth decay, heart disease, and other degenerative diseases, according to studies conducted by Dr. W. H. Bowen of the National Institute of Health and Dr. J. Yudben, Professor of Nutrition and Dietetics at the University of London.
一项研究，经常食用预先加糖的谷物会导致蛀牙、心脏病和其他退行性疾病。
In a random sample consisting of 20 similar single servings of Alpha-Bits, the average sugar content was 11.3 grams with a standard deviation of 2.45 grams.
在由 20 份类似的单份 Alpha-Bits 组成的随机样本中，平均含糖量为 11.3 克，标准偏差为 2.45 克。
Assuming that the sugar contents are normally distributed, construct a 95% confidence interval for the mean sugar content for single servings of Alpha-Bits.
假设糖含量呈正态分布，为单份 Alpha-Bits 的平均糖含量构建 95% 的置信区间。

[1] 10.15336

[1] 12.44664

2. Please construct a 95% confidence interval for the mean of Sepal.Length in the Iris data set.
Iris 数据集中的 Sepal.Length 的均值构建 95% 的置信区间。

 One Sample t-test

data:  iris\$Sepal.Length
t = 86.425, df = 149, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
5.709732 5.976934
sample estimates:
mean of x
5.843333


# # Estimating the Difference between Two Means 评价两个均值之间的差异

Next, we want to investigate two populations with means $\mu_1$ and $\mu_2$ and variances $\sigma_1^2$ and $\sigma_2^2$, respectively.

Thus we can have a point estimator of the difference between $\mu_1$ and $\mu_2$ is given by the statistic $\bar{X}_1-\bar{X}_2$.

## # Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2$ and $\sigma_2^2$ known 总体方差 $\sigma_1^2$ 和 $\sigma_2^2$ 已知时，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

If $\bar{x}_1$ and $\bar{x}_2$ are the means of independent random samples of sizes $n_1$ and $n_2$ from populations with known variances $\sigma_1^2$ and $\sigma_2^2$, respectively, a $100(1-\alpha)\%$ confidence interval for $\mu_1-\mu_2$ is given by

$(\bar{x}_1-\bar{x}_2) - z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1} +\frac{\sigma^2_2}{n_2}}< \mu_1-\mu_2 < (\bar{x}_1-\bar{x}_2) + z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1} +\frac{\sigma^2_2}{n_2}}$

where $z_{\alpha/2}$ is the $z-$value leaving an area of $\alpha/2$ to the right.
$z_{\alpha/2}$ 代表在 $\alpha/2$ 右侧区域取得的$z-$ 值。

### # Example 2

A study was conducted in which two types of engines, $A$ and $B,$ were compared.

Gas mileage, in miles per gallon, was measured.

Fifty experiments were conducted using engine type $A$ and 75 experiments were done with engine type $B$ .
$A$ 型进行了 50 次实验，$B$ 型进行了 75 次。

The gasoline used and other conditions were held constant.

The average gas mileage was 36 miles per gallon for engine $A$ and 42 miles per gallon for engine $B$.
$A$ 发动机的平均油耗为每加仑 36 英里。$B$ 发动机每加仑行驶 42 英里。

Find a 96% confidence interval on $\mu_B - \mu_A,$ where $\mu_A$ and $\mu_B$ are population mean gas mileages for engines $A$ and $B$, respectively.
$\mu_A$ $\mu_B$ 分别是 $A$ $B$ 的均值，找到$\mu_B - \mu_A,$ 96% 的置信区间。

Assume that the population standard deviations are 6 and 8 for engines $A$ and $B$, respectively.

[1] 3.42393

[1] 8.57607


### # In-class Exercise: CI on $\mu_1-\mu_2$ with known $\sigma_1$ and $\sigma_2$ 已知总体方差 $\sigma_1$ 和 $\sigma_2$，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

1. Generate 50 normal random numbers with $\mu_1 = 3$ and $\sigma_1 = 3$ and get the sample mean $x_1$.
Hint: rnorm
$\mu_1 = 3$$\sigma_1 = 3$ 的正态中生成 50 个随机数，并得到样本均值 $x_1$

2. Generate 75 normal random numbers with $\mu_2 = 2$ and $\sigma_2 = 4$ and get the sample mean $x_2$
$\mu_2 = 2$$\sigma_2 = 4$ 的正态中生成 75 个随机数，并得到样本均值 $x_2$

3. Please construct a 95% confidence interval on $\mu1-\mu2$.
请构建两个总体均值差异 $\mu1-\mu2$ 的 95% 置信区间 。

[1] 0.9058725

[1] 1.683297


Question: If we don't know the variances, what should we do?

## # Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2 = \sigma_2^2$ but Both Unknown 已知 $\sigma_1^2 = \sigma_2^2$ 但两者都未知，求$\mu_1-\mu_2$ 的置信区间

If $\bar{x}_1$ and $\bar{x}_2$ are the means of independent random samples of sizes $n_1$ and $n_2$, respectively, from $\textit{approximately normal populations}$ with $\textbf{unknown but equal variances}$ , a $100(1-\alpha)\%$ cofidence interval for $\mu_1-\mu_2$ is given by

$(\bar{x}_1-\bar{x}_2) - t_{n_1+n_2-2,\alpha/2}s_p\sqrt{\frac{1}{n_1} +\frac{1}{n_2}}< \mu_1-\mu_2 < (\bar{x}_1-\bar{x}_2) + t_{n_1+n_2-2,\alpha/2}s_p\sqrt{\frac{1}{n_1} +\frac{1}{n_2}}$

where

$s_p^2 = \frac{(n_1-1)S_1^2 +(n_2-1)S_2^2 }{n_1+n_2-2}$

is the pooled estimater of the population standard deviation and $t_{n_1+n_2-2,\alpha/2}$ is the $t-$value with $v =n_1+n_2-2$ degrees of freedom, leaving an area of $\alpha/2$ to the right.

### # Example 3 t.test

In a study conducted at Virginia Tech on the development of ectomycorrhizal, a symbiotic relationship between the roots of trees and a fungus, in which minerals are transferred from the fungus to the trees and sugars from the trees to the fungus, 20 northern red oak seedlings exposed to the fungus Pisolithus tinctorus were grown in a greenhouse.

All seedlings were planted in the same type of soil and received the same amount of sunshine and water.

Half received no nitrogen at planting time, to serve as a control, and the other half received 368 ppm of nitrogen in the form NaNO3.

The stem weights, in grams, at the end of 140 days were recorded as follows:

No Nitrogen:
0.32 0.53 0.28 0.37 0.47 0.43 0.36 0.42 0.38 0.43

Nitrogen:
0.26 0.43 0.47 0.49 0.52 0.75 0.79 0.86 0.62 0.46


Construct a 95% confidence interval for the difference in the mean stem weight between seedlings that receive no nitrogen and those that receive 368 ppm of nitrogen. Assume the populations to be normally distributed with equal variances.

[1] -0.2991579

[1] -0.03284212


Can we use t.test ? Yes!

We can use two sample t-test.

Don't forget to add the condition var.equal = TURE .

    Two Sample t-test

data:  noNitro and Nitro
t = -2.6191, df = 18, p-value = 0.01739
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.29915788 -0.03284212
sample estimates:
mean of x mean of y
0.399     0.565


### # In-class Exercise A: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

1. Generate 50 normal random numbers with $\mu_1 = 3$ and $\sigma_1 = 3$ and get the sample mean $x_1$.
Hint: rnorm
$\mu_1 = 3$$\sigma_1 = 3$ 的正态总体取 50 个随机数，并得到样本均值 $x_1$

2. Generate 75 normal random numbers with $\mu_2 = 2$ and $\sigma_2 = 3$ and get the sample mean $x_2$
$\mu_2 = 2$$\sigma_2 = 4$ 的正态总体取 75 个随机数，并得到样本均值 $x_2$

3. Please construct a 95% confidence interval on $\mu1-\mu2$ suppose we only know $\sigma_1 = \sigma_2$.
请构建总体均值差异 $\mu1-\mu2$ 的 95% 置信区间，假设只知道 $\sigma_1 = \sigma_2$

Hint: You can use t-test directly.

  Two Sample t-test

data:  x1 and x2
t = 1.8673, df = 123, p-value = 0.06424
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.0618207  2.1201473
sample estimates:
mean of x mean of y
3.393593  2.364429


### # In-class Exercise B: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

Let's play with iris data set. We want to compare the population mean of Sepal.Length between Setosa and Virginica .

Suppose we know these two species have the same variance.

Can you construct a confidence interval on $\mu_{virginica}-\mu_{setosa}$ with 95% confidence?

 Two Sample t-test

data:  x1 and x2
t = 15.386, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.377958 1.786042
sample estimates:
mean of x mean of y
6.588     5.006


Note: t.test actually can provide more functions. If you know the population variances are unknow but different, you still can use t.test to construct the confidence interval on $\mu_1-\mu_2$.
t.test 其实可以提供更多的功能。如果总体方差未知但不同，仍然可以使用 t.test 来构建$\mu_1-\mu_2$ 置信区间。

# # Interval Estimate of a Proportion/Difference between Proportions 比例 / 比例间差异的区间估计

A point estimator of the proportion $p$ in a binomial experiment is given by the statistic $\hat{p} =x/n$ , where $x$ represents the number of successes in $n$ trials.

If the unknown proportion $p$ is not expected to be too close to 0 or 1, we can establish a confidence interval for $p$ by considering the sampling distribution of $\hat{p}$.

The sample proportion $\hat{p} = x/n$ is the sample mean of these $n$ values.

Hence, by the Central Limit Theorem, for $n$ is sufficiently large, $\hat{P}$ is approximately normally distributed with mean $\mu_{\hat{P}} = p$ and variance $\sigma^2_{\hat{p}} = \frac{p(1-p)}{n}$.

Therefore

$Z = \frac{\hat{p}-p}{\sqrt{p(1-p)/n}} \sim \mathcal{N}(0,1)$

where $z_{\alpha/2}$ is the $z$-value leaving an area of $\alpha/2$ to the right.

Note: If we don't have the exact $p$, we will use $\hat{p}$ to approximate $p$ under the radical sign.

## # Large-Sample Confidence Interval on $p$ 大样本比例$p$ 的置信区间

If $\hat{p}$ is proportion of successes in a random sample of size $n$, an approximate $100(1 -\alpha)\%$ confidence interval, for the binomial parameter $p$ is given by

$\hat{p} - z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} < p < \hat{p} + z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},$

where $z_{\alpha/2}$ is the $z$-value leaving an area of $\alpha/2$ to the right.

### # Example 4

In a random sample of $n = 500$ families owning television sets in the city of Hamilton, Canada, it is found that $x = 340$ subscribe to HBO.

Find a 95% confidence interval for the actual proportion of families with television sets in this city that subscribe to HBO.

[1] 0.6391123

[1] 0.7208877


Similarly, we can obtain a confidence interval on $p_1-p_2$.

## # Large-Sample Confidence Interval on $p_1-p_2$ 大样本比例差异$p_1-p_2$ 的置信区间

If $\hat{p}_1$ and $\hat{p}_2$ are the proportions of successes in random samples of size $n_1$ and $n_2$, respectively, $\hat{q}_1 = 1-\hat{p}_1$, and $\hat{q}_2 = 1-\hat{p}_2$, an approximate $100(1 -\alpha)\%$ confidence interval for the difference of two binomial parameters, $p_1-p_2$, is given by

$(\hat{p}_1-\hat{p}_2) - z_{\alpha/2}\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1} + \frac{\hat{p}_2\hat{q}_2}{n_2}} < p_1-p_2 < (\hat{p}_1-\hat{p}_2) + z_{\alpha/2}\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1} + \frac{\hat{p}_2\hat{q}_2}{n_2}}$

where $z_{\alpha/2}$ is the $z$-value leaving an area of $\alpha/2$ to the right.

### # Example 5

A certain change in a process for manufacturing component parts is being considered.

Samples are taken under both the existing and the new process so as to determine if the new process results in an improvement.

If 75 of 1500 items from the existing process are found to be defective and 80 of 2000 items from the new process are found to be defective, find a 90% confidence interval for the true difference in the proportion of defectives between the existing and the new process.

[1] -0.001731239

[1] 0.02173124


### # In-class Exercise: Interval Estimate of a Proportion/Difference between Proportions

There are two classifiers to detect spam emails.

For classifier A, 70 of 1000 emails are found to be spam; for classifier B, 100 of 1500 emails are found to be spam.

Construct a 95% confidence interval for the true difference in the proportion of spam emails between these two classifiers.

[1] -0.016901

[1] 0.02356767


# # Estimating the Variance and the Ratio of Two Variances 估计方差和两个方差的比率

## # Interval Estimate of the Variance 方差的区间估计

We already know $S^2$ is an unbiased the estimator of $\sigma^2$.

If a sample of size $n$ is drawn from a normal population with variance $\sigma^2$, an interval estimate of $\sigma^2$ can be established by using the statistic

$X^2 = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}$

### # Confidence Interval for $\sigma^2$ 总体方差 $\sigma^2$ 的置信区间

If $s^2$ is the variance of a random sample of size $n$ from a $\textit{normal population}$ , a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ is given by

$\frac{(n-1)S^2}{\chi^2_{\alpha/2,n-1}} < \sigma^2 < \frac{(n-1)S^2}{\chi^2_{1-\alpha/2,n-1}}$

where $\chi^2_{1-\alpha/2,n-1}$ and $\chi^2_{\alpha/2,n-1}$ are values of the chi-squared distribution with $n-1$ degrees of freedom, leaving areas of $1-\alpha/2$ and $\alpha/2$, respectively, to the right.

### # Example 6

The following are the weights, in decagrams, of 10 packages of grass seed distributed by a certain company: 46.4, 46.1, 45.8, 47.0, 46.1, 45.9, 45.8, 46.9, 45.2, and 46.0.

Find a 95% confidence interval for the variance of the weights of all such packages of grass seed distributed by this company, assuming a normal population.

[1] 0.1354167

[1] 0.9539365


### # In-class Exercise: Confidence Interval for $\sigma^2$

Please construct a 90% confidence interval for the variance of Sepal.Length of all records in iris data set.

[1] 0.5724186

[1] 0.8389097


## # Estimating the Ratio of Two Variances 估计两个方差比

A point estimate of the ratio of two population variances $\sigma^2_1/\sigma_2^2$ is given by the ratio $s_1^2/s_2^2$ of the sample variances.

Hence, the statistic $S_1^2/S_2^2$ is called an estimator of $\sigma^2_1/\sigma_2^2$.

If $\sigma^2_1$ and $\sigma_2^2$ are the variances of normal populations, we can using the statistic

$F = \frac{\sigma_2^2S_1^2}{\sigma_1^2S_2^2} \sim F(n_1-1,n_2-1)$

to establish an interval estimate of $\sigma_1^2/\sigma_2^2$.

### # Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间

If $s_1^2$ and $s_2^2$ are the variances of a random sample of size $n_1$ and $n_2$, respectively, from normal populations, then a $100(1-\alpha)\%$ confidence interval for $\sigma_1^2/\sigma_2^2$ is given by

$\frac{s_1^2}{s_2^2} \frac{1}{f_{\alpha/2}(n_1-1,n_2-1)} < \frac{\sigma_1^2}{\sigma_2^2} < \frac{s_1^2}{s_2^2}f_{\alpha/2}(n_2-1,n_1-1)$

where $f_{\alpha/2}(n_1-1,n_2-1)$ is an $f$-value with $n_1-1$ and $n_2-1$ degrees of freedom, leaving an area of $\alpha/2$ to the right, and $f_{\alpha/2}(n_2-1,n_1-1)$ is an $f$-value with $n_2-1$ and $n_1-1$ degrees of freedom.

### # Example 7

A confidence interval for the difference in the mean orthophosphorus contents, measured in milligrams per liter, at two stations on the James River by assuming the normal population variance to be unequal.

Orthophosphorus was measured in milligrams per liter.

Fifteen samples were collected from station 1, and 12 samples were obtained from station 2.

The 15 samples from station 1 had an average orthophosphorus content of 3.84 milligrams per liter and a standard deviation of 3.07 milligrams per liter, while the 12 samples from station 2 had an average content of 1.49 milligrams per liter and a standard deviation of 0.80 milligram per liter.
1 号站的 15 个样品平均磷含量为每升 3.84 毫克，标准差为每升 3.07 毫克；2 号站的 12 个样品平均含量为每升 1.49 毫克，标准差为每升 0.80 毫克。

Justify this assumption by constructing 98% confidence intervals for $\sigma_1^2/\sigma_2^2$ and for $\sigma_1/\sigma_2$, where $\sigma_1^2$ and $\sigma_2^2$ are the variances of the populations of
orthophosphorus contents at station 1 and station 2, respectively.
$\sigma_1^2$$\sigma_2^2$ 分别是 1 号站和 2 号站的磷含量总体方差，评估 $\sigma_1^2/\sigma_2^2$$\sigma_1/\sigma_2$ 的 98% 的置信区间

[1] 3.430136

[1] 56.90341


Taking square roots of the confidence limits, we find that a 98% confidence interval for $\sigma_1/\sigma_2$ is

$1.851 < \frac{\sigma_1}{\sigma_2} < 7.549$

Since this interval does not allow for the possibility of $\sigma_1/\sigma_2$ being equal to 1, we were correct in assuming that $\sigma_1 \ne \sigma^2$ or $\sigma_1^2 \ne \sigma^2_2$.

### # In-class Exercise: Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间

Please construct a 96% confidence interval for the ratio of variances of Sepal.Length between two species Setosa and Virginica in iris data set, i.e., $\sigma^2_{virginica}/\sigma^2_{setosa}$, assume Sepal.Length of two species have approximately normal distributions.

[1] 1.796658

[1] 5.89452


# # Conlusions

1. If we know the variances, we can use $z$ value to estimate $\mu$ or $\mu_1-\mu_2$.
如果我们知道总体方差，我们可以使用 $z$ 值 估计总体均值 $\mu$ 或者 总体均值差 $\mu_1-\mu_2$

2. If we don't know the variances, we should use $t$ value to estimate $\mu$ or $\mu_1-\mu_2$ and the distributions of the populations are approximately normal.
如果我们不知道总体方差，我们应该使用 $t$ 值 估计总体均值 $\mu$ 或者 总体均值差 $\mu_1-\mu_2$，并且总体的分布近似正态。

3. To estimate $\sigma^2$ for normal distributions, we need $\chi^2$ distribution; To estimate $\sigma_1^2/\sigma^2_2$, we need $F$-distribution.
估计正态分布的$\sigma^2$，需要 $\chi^2$ 分布；估计$\sigma_1^2/\sigma^2_2$，需要 $F$ 分布。

# # References

1. Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall