# Objectives 目标

Understand the difference between Point Estimate and Interval Estimate
了解点估计和区间估计的区别
Know how to use R to construct a confidence interval on Mean/Proportion , Difference of Two Means/Proportions, Variance and Ratio of Variances
知道如何使用 R 构建均值 / 比例、两个均值 / 比例的差值、方差和方差比的置信区间

# Point Estimate (Not good) 点估计

	# Generate standard normal random numbers with sample size 100
	samplesize<- 100
	normsample<- rnorm(samplesize)
	mean(normsample)

[1] 0.04918487

median(normsample)

[1] 0.203471

var(normsample)

[1] 1.020872

Let's increase the sample size to 10000

	# Generate standard normal random numbers with sample size 10000
	samplesize<- 10000
	normsample<- rnorm(samplesize)
	mean(normsample)

[1] 0.005888358

median(normsample)

[1] 0.005134332

var(normsample)

[1] 0.9931025

Which point estimator is 'better'?

# Interval Estimate 区间估计

# Confidence Interval on $\mu$ when $\sigma$ known 已知总体标准差 $\sigma$ ，求均值 $\mu$ 的置信区间

The interval $(\bar{X}-1.96\sigma/\sqrt{n},\bar{X}+1.96\sigma/\sqrt{n})$ contains the true population mean with probability 95%, where $\bar{X}$ is the sample mean, $n$ is the sample size, and $\sigma$ is population standard deviation.
区间 $(\bar{X}-1.96\sigma/\sqrt{n},\bar{X}+1.96\sigma/\sqrt{n})$ 有 95% 的概率包含真实的总体均值，其中 $\bar{X}$ 是样本均值， $n$ 是样本大小， $\sigma$ 是总体标准差。

how was this calculated?
这是如何计算的？
what does the "95% probability" mean?
“95% 的概率” 是什么意思？

According to the Central Limit Theorem, we have $\bar{X}$ is approximately normal distributed with mean $\mu$ and variance $\sigma^2/n$ ,
根据中心极限定理，样本 $\bar{X}$ 取自总体均值为 $\mu$ ，总体方差为 $\sigma^2/n$ 的近似正态分布

i.e.,

$Z = \frac{\bar{X}-\mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1)$

Suppose $z_{\alpha/2}$ denote the value such that
假设 $z_{\alpha/2}$ 表示

$P(-z_{\alpha/2} < Z <z_{\alpha/2}) = 1- \alpha.$

As confidence is 95%, we know $\alpha = 1-95\% = 0.05$ , that is $\alpha/2 = 0.025$ .
由于置信度为 95%，则 $\alpha = 1-95\% = 0.05$ ，即 $\alpha/2 = 0.025$ 。

How to find $z_{\alpha/2}$ ?
如何找到 $z_{\alpha/2}$ ？

	z_0.025 <- qnorm(0.975) # why?
	z_0.025

[1] 1.959964

Why confidence interval better?
为什么置信区间更好？

The $100(1-\alpha)\%$ confidence interval provides an estimate of the accuracy of our point estimate.
$100(1-\alpha)\%$ 置信区间提供了对点估计准确性的评价。

Question: How to construct 99% confidence interval?
如何构建 99% 置信区间？

# In-class Exercise: CI on $\mu$ with known variance 已知总体方差，求总体均值 $\mu$ 的置信区间

The average zinc concentration recovered from a sample of measurements taken in 36 different locations in a river is found to be 2.6 grams per milliliter.
从河流中不同位置的 36 个测量样本中回收的平均锌浓度为 2.6 克 / 毫升。

Find the 95% and 99% confidence intervals for the mean zinc concentration in the river.
找出河流中平均锌浓度的 95% 和 99% 置信区间。

Assume that the population standard deviation is 0.3 gram per milliliter.
假设总体标准偏差为 0.3 克 / 毫升。

	n <- 36
	sigma <- 0.3
	xbar <- 2.6
	alpha <- 0.05
	loBound <- xbar - qnorm(1-alpha/2)*sigma/sqrt(n)
	loBound

[1] 2.502002

	upBound <- xbar + qnorm(1-alpha/2)*sigma/sqrt(n)
	upBound

[1] 2.697998

	alpha <- 0.01
	loBound <- xbar - qnorm(1-alpha/2)*sigma/sqrt(n)
	loBound

[1] 2.471209

	upBound <- xbar + qnorm(1-alpha/2)*sigma/sqrt(n)
	upBound

[1] 2.728791

# Confidence Interval on $\mu$ when $\sigma^2$ unknown 总体方差 $\sigma^2$ 未知，求总体均值 $\mu$ 的置信区间

# If we don't know $\sigma$ 如果不知道总体标准差 $\sigma$ `t-distribution`

If the population follows normal distribution or approximately normal distribution, we can use $t-$ distribution.
如果总体服从正态分布或近似正态分布，我们可以使用 $t-$ 分布。

If we have a random sample from a normal distribution, then the random variable
如果我们有一个来自正态分布的随机样本，那么随机变量

$T = \frac{\bar{X}-\mu}{S/\sqrt{n}}$

has a Student $t$ -distribution with $n-1$ degrees of freedom.
呈自由度为 $n-1$ 的 Student $t-$ 分布。

Here $S$ is the sample standard deviation.
这里 $S$ 指样本标准差。

If $\bar{x}$ and $s$ are the mean and the standard deviation of a random sample of size $n$ from a $\textbf{normal}$ distribution with unknown variance $\sigma^2$ , a $100(1-\alpha)\%$ confidence interval for $\mu$ is given by
如果大小为 $n$ 的随机样本，样本均值为 $\bar{x}$ 、样本标准差为 $s$ ，取自具有未知总体方差 $\sigma^2$ 的正态分布总体，则总体均值 $\mu$ 的 $100(1 -\alpha)\%$ 置信区间由下式给出

$\bar{x} - t_{n-1,\alpha/2}\frac{S}{\sqrt{n}} < \mu < \bar{x} + t_{n-1,\alpha/2}\frac{S}{\sqrt{n}},$

where $t_{n-1,\alpha/2}$ is the $t-$ value with $v =n-1$ degrees of freedom, leaving an area of $\alpha/2$ to the right.
其中 $t_{n-1,\alpha/2}$ 表示自由度为 $v =n-1$ 、 $\alpha/2$ 右侧区域的 $t-$ 值。

# Example 1 `t.test`

The contents of seven similar containers of sulfuric acid are 9.8, 10.2, 10.4, 9.8, 10.0, 10.2, and 9.6 liters.
七个类似容器的硫酸容量分别为 9.8、10.2、10.4、9.8、10.0、10.2 和 9.6 升。

Find a 95% confidence interval for the mean contents of all such containers, assuming an approximately normal distribution.
假设总体近似正态分布，计算所有此类容器容量平均值的 95% 置信区间。

	sulfruicAcid<-c(9.8,10.2,10.4,9.8,10.0,10.2,9.6)
	xbar <- mean(sulfruicAcid)
	s <-sd(sulfruicAcid)
	n <- 7
	t6_0.025<- qt(0.975,n-1)
	lowerBound <- xbar-t6_0.025*s/sqrt(n)
	lowerBound

[1] 9.738414

	upperBound <- xbar+t6_0.025*s/sqrt(n)
	upperBound

[1] 10.26159

A different way to get confidence interval
获得置信区间的另一种方法

t.test(sulfruicAcid, conf.level = .95)

    One Sample t-test

data:  sulfruicAcid
t = 93.541, df = 6, p-value = 1.006e-10
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
  9.738414 10.261586
sample estimates:
mean of x 
       10

# In-class Exercise: CI on $\mu$ with unknown variance 总体方差未知，求总体均值 $\mu$ 的置信区间

Regular consumption of presweetened cereals contributes to tooth decay, heart disease, and other degenerative diseases, according to studies conducted by Dr. W. H. Bowen of the National Institute of Health and Dr. J. Yudben, Professor of Nutrition and Dietetics at the University of London.
一项研究，经常食用预先加糖的谷物会导致蛀牙、心脏病和其他退行性疾病。
In a random sample consisting of 20 similar single servings of Alpha-Bits, the average sugar content was 11.3 grams with a standard deviation of 2.45 grams.
在由 20 份类似的单份 Alpha-Bits 组成的随机样本中，平均含糖量为 11.3 克，标准偏差为 2.45 克。
Assuming that the sugar contents are normally distributed, construct a 95% confidence interval for the mean sugar content for single servings of Alpha-Bits.
假设糖含量呈正态分布，为单份 Alpha-Bits 的平均糖含量构建 95% 的置信区间。
n <- 20
xbar <- 11.3
s <- 2.45
alpha <- 0.05
tvalue <- qt(1-alpha/2,df = n-1)
loBound <- xbar - tvalue *s/sqrt(n)
loBound
```
[1] 10.15336
```
upBound <- xbar + tvalue *s/sqrt(n)
upBound
```
[1] 12.44664
```

Please construct a 95% confidence interval for the mean of Sepal.Length in the Iris data set.
为 Iris 数据集中的 Sepal.Length 的均值构建 95% 的置信区间。

t.test(iris$Sepal.Length)

 One Sample t-test

data:  iris$Sepal.Length
t = 86.425, df = 149, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 5.709732 5.976934
sample estimates:
mean of x 
 5.843333

# Estimating the Difference between Two Means 评价两个均值之间的差异

Next, we want to investigate two populations with means $\mu_1$ and $\mu_2$ and variances $\sigma_1^2$ and $\sigma_2^2$ , respectively.
调查分别有均值 $\mu_1$ 和 $\mu_2$ ，方差 $\sigma_1^2$ 和 $\sigma_2^2$ 的两个群体。

Thus we can have a point estimator of the difference between $\mu_1$ and $\mu_2$ is given by the statistic $\bar{X}_1-\bar{X}_2$ .
可以用统计量 $\bar{X}_1-\bar{X}_2$ 表示 $\mu_1$ 和 $\mu_2$ 之间的差异。

# Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2$ and $\sigma_2^2$ known 总体方差 $\sigma_1^2$ 和 $\sigma_2^2$ 已知时，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

If $\bar{x}_1$ and $\bar{x}_2$ are the means of independent random samples of sizes $n_1$ and $n_2$ from populations with known variances $\sigma_1^2$ and $\sigma_2^2$ , respectively, a $100(1-\alpha)\%$ confidence interval for $\mu_1-\mu_2$ is given by
如果大小 $n_1$ 和 $n_2$ 的独立随机样本的均值分别是 $\bar{x}_1$ 和 $\bar{x}_2$ ，取自总体方差为 $\sigma_1^2$ 和 $\sigma_2^2$ 的两个总体，则总体均值差异 $\mu_1-\mu_2$ 的 $100(1-\alpha)\%$ 的置信区间为

$(\bar{x}_1-\bar{x}_2) - z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1} +\frac{\sigma^2_2}{n_2}}< \mu_1-\mu_2 < (\bar{x}_1-\bar{x}_2) + z_{\alpha/2}\sqrt{\frac{\sigma^2_1}{n_1} +\frac{\sigma^2_2}{n_2}}$

where $z_{\alpha/2}$ is the $z-$ value leaving an area of $\alpha/2$ to the right.
$z_{\alpha/2}$ 代表在 $\alpha/2$ 右侧区域取得的 $z-$ 值。

# Example 2

A study was conducted in which two types of engines, $A$ and $B,$ were compared.
一项研究，比较 $A$ 和 $B$ 两种类型的发动机。

Gas mileage, in miles per gallon, was measured.
以英里 / 加仑为单位测量汽油里程。

Fifty experiments were conducted using engine type $A$ and 75 experiments were done with engine type $B$ .
$A$ 型进行了 50 次实验， $B$ 型进行了 75 次。

The gasoline used and other conditions were held constant.
使用的汽油和其他条件保持不变。

The average gas mileage was 36 miles per gallon for engine $A$ and 42 miles per gallon for engine $B$ .
$A$ 发动机的平均油耗为每加仑 36 英里。 $B$ 发动机每加仑行驶 42 英里。

Find a 96% confidence interval on $\mu_B - \mu_A,$ where $\mu_A$ and $\mu_B$ are population mean gas mileages for engines $A$ and $B$ , respectively.
$\mu_A$ $\mu_B$ 分别是 $A$ $B$ 的均值，找到 $\mu_B - \mu_A,$ 96% 的置信区间。

Assume that the population standard deviations are 6 and 8 for engines $A$ and $B$ , respectively.
假设 $A$ $B$ 发动机的总体标准差分别为 6 和 8。

	alpha <- 0.04
	n1 <- 50
	x1 <- 36
	n2 <- 75
	x2 <- 42
	sigma1 <- 6
	sigma2 <- 8
	standerror <- sqrt(sigma1^2 /n1 + sigma2^2/n2)
	lowerBound <- x2-x1 - qnorm(1-alpha/2)*standerror
	lowerBound

[1] 3.42393

	upperBound <- x2-x1 + qnorm(1-alpha/2)*standerror
	upperBound

[1] 8.57607

# In-class Exercise: CI on $\mu_1-\mu_2$ with known $\sigma_1$ and $\sigma_2$ 已知总体方差 $\sigma_1$ 和 $\sigma_2$ ，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

Generate 50 normal random numbers with $\mu_1 = 3$ and $\sigma_1 = 3$ and get the sample mean $x_1$ .
Hint: rnorm
从 $\mu_1 = 3$ 和 $\sigma_1 = 3$ 的正态中生成 50 个随机数，并得到样本均值 $x_1$ 。
Generate 75 normal random numbers with $\mu_2 = 2$ and $\sigma_2 = 4$ and get the sample mean $x_2$
从 $\mu_2 = 2$ 和 $\sigma_2 = 4$ 的正态中生成 75 个随机数，并得到样本均值 $x_2$ 。
Please construct a 95% confidence interval on $\mu1-\mu2$ .
请构建两个总体均值差异 $\mu1-\mu2$ 的 95% 置信区间。

	x1<-rnorm(500,mean = 3, sd = 3)
	x2<-rnorm(750,mean = 2, sd = 4)
	x1bar <- mean(x1)
	x2bar <- mean(x2)
	standerr <- sqrt(3^2/500+4^2/750)
	alpha <- 0.05
	zalpha2 <- qnorm(1-alpha/2)
	loBound <- x1bar - x2bar -zalpha2*standerr
	loBound

[1] 0.9058725

	upBound <- x1bar - x2bar + zalpha2*standerr
	upBound

[1] 1.683297

Question: If we don't know the variances, what should we do?
如果我们不知道方差，我们应该怎么做？

# Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2 = \sigma_2^2$ but Both Unknown 已知 $\sigma_1^2 = \sigma_2^2$ 但两者都未知，求 $\mu_1-\mu_2$ 的置信区间

If $\bar{x}_1$ and $\bar{x}_2$ are the means of independent random samples of sizes $n_1$ and $n_2$ , respectively, from $\textit{approximately normal populations}$ with $\textbf{unknown but equal variances}$ , a $100(1-\alpha)\%$ cofidence interval for $\mu_1-\mu_2$ is given by
从近似正态、方差相等但未知的两个群体中，分别取大小为 $n_1$ 和 $n_2$ 的独立随机样本，样本均值分别为 $\bar{x}_1$ 和 $\bar{x}_2$ 。总体均值差异 $\mu_1-\mu_2$ 的 $100(1-\alpha)\%$ 置信区间为

$(\bar{x}_1-\bar{x}_2) - t_{n_1+n_2-2,\alpha/2}s_p\sqrt{\frac{1}{n_1} +\frac{1}{n_2}}< \mu_1-\mu_2 < (\bar{x}_1-\bar{x}_2) + t_{n_1+n_2-2,\alpha/2}s_p\sqrt{\frac{1}{n_1} +\frac{1}{n_2}}$

where

$s_p^2 = \frac{(n_1-1)S_1^2 +(n_2-1)S_2^2 }{n_1+n_2-2}$

is the pooled estimater of the population standard deviation and $t_{n_1+n_2-2,\alpha/2}$ is the $t-$ value with $v =n_1+n_2-2$ degrees of freedom, leaving an area of $\alpha/2$ to the right.
其中 $s_p^2$ 是总体标准差的预测值， $t_{n_1+n_2-2,\alpha/2}$ 是自由度 $v =n_1+n_2-2$ 、 $\alpha/2$ 右侧的 t 值分布

# Example 3 `t.test`

In a study conducted at Virginia Tech on the development of ectomycorrhizal, a symbiotic relationship between the roots of trees and a fungus, in which minerals are transferred from the fungus to the trees and sugars from the trees to the fungus, 20 northern red oak seedlings exposed to the fungus Pisolithus tinctorus were grown in a greenhouse.
关于外生菌根发展的研究中，研究树根和真菌之间的共生关系，其中矿物质从真菌转移到树木，糖从树转移到真菌，20 棵暴露在真菌中的幼苗生长在温室中。

All seedlings were planted in the same type of soil and received the same amount of sunshine and water.
所有幼苗都种植在同一类型的土壤中，并得到同样数量的阳光和水。

Half received no nitrogen at planting time, to serve as a control, and the other half received 368 ppm of nitrogen in the form NaNO₃.
一半在种植时没有接受氮作为对照组，另一半从 NANO₃ 接受 368ppm 的氮。

The stem weights, in grams, at the end of 140 days were recorded as follows:
在 140 天结束时，以克为单位的茎重量记录如下：

No Nitrogen: 
0.32 0.53 0.28 0.37 0.47 0.43 0.36 0.42 0.38 0.43

Nitrogen: 
0.26 0.43 0.47 0.49 0.52 0.75 0.79 0.86 0.62 0.46

Construct a 95% confidence interval for the difference in the mean stem weight between seedlings that receive no nitrogen and those that receive 368 ppm of nitrogen. Assume the populations to be normally distributed with equal variances.
为不接受氮的幼苗和接受 368 ppm 氮的幼苗之间的平均茎重差异构建 95% 置信区间。假设总体呈正态分布，方差相等。

	noNitro<- c(0.32,0.53 ,0.28, 0.37, 0.47, 0.43, 0.36, 0.42,0.38,0.43)
	x1 <- mean(noNitro)
	s1 <- sd(noNitro)
	n1 <- 10
	Nitro<- c(0.26,0.43,0.47,0.49,0.52,0.75,0.79,0.86, 0.62,0.46)
	x2 <- mean(Nitro)
	s2<- sd(Nitro)
	n2<- 10
	sp<- sqrt(((n1-1)s1^2 + (n2-1)s2^2)/(n1+n2-2))
	alpha <- 0.05
	loBound<- x1-x2 - qt(1-alpha/2,n1+n2-2)spsqrt(1/n1+1/n2)
	loBound

[1] -0.2991579

	upBound<-x1-x2 + qt(1-alpha/2,n1+n2-2)spsqrt(1/n1+1/n2)
	upBound

[1] -0.03284212

Can we use t.test ? Yes!

We can use two sample t-test.
我们可以使用双样本 t 检验。

Don't forget to add the condition var.equal = TURE .
不要忘记添加条件 var.equal = TURE 。

t.test(noNitro, Nitro, var.equal = TRUE, conf.level = .95)

    Two Sample t-test

data:  noNitro and Nitro
t = -2.6191, df = 18, p-value = 0.01739
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.29915788 -0.03284212
sample estimates:
mean of x mean of y 
    0.399     0.565

# In-class Exercise A: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

Generate 50 normal random numbers with $\mu_1 = 3$ and $\sigma_1 = 3$ and get the sample mean $x_1$ .
Hint: rnorm
从 $\mu_1 = 3$ 和 $\sigma_1 = 3$ 的正态总体取 50 个随机数，并得到样本均值 $x_1$ 。
Generate 75 normal random numbers with $\mu_2 = 2$ and $\sigma_2 = 3$ and get the sample mean $x_2$
从 $\mu_2 = 2$ 和 $\sigma_2 = 4$ 的正态总体取 75 个随机数，并得到样本均值 $x_2$ 。
Please construct a 95% confidence interval on $\mu1-\mu2$ suppose we only know $\sigma_1 = \sigma_2$ .
请构建总体均值差异 $\mu1-\mu2$ 的 95% 置信区间，假设只知道 $\sigma_1 = \sigma_2$ 。

Hint: You can use t-test directly.

	x1<-rnorm(50,mean = 3, sd = 3)
	x2<-rnorm(75,mean = 2, sd = 3)
	t.test(x1,x2, var.equal = TRUE, conf.level = .95)

  Two Sample t-test
 
data:  x1 and x2
t = 1.8673, df = 123, p-value = 0.06424
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  -0.0618207  2.1201473
sample estimates:
mean of x mean of y 
    3.393593  2.364429

# In-class Exercise B: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

Let's play with iris data set. We want to compare the population mean of Sepal.Length between Setosa and Virginica .
我们要比较 Setosa 和 Virginica 的 Sepal.Length 的总体均值

Suppose we know these two species have the same variance.
假设我们知道这两个物种具有相同的方差。

Can you construct a confidence interval on $\mu_{virginica}-\mu_{setosa}$ with 95% confidence?
建立 $\mu_{virginica}-\mu_{setosa}$ 的 95% 置信区间

	x1<-iris$Sepal.Length[iris$Species == 'virginica']
	x2<-iris$Sepal.Length[iris$Species == 'setosa']
	t.test(x1,x2, var.equal = TRUE, conf.level = .95)

 Two Sample t-test

data:  x1 and x2
t = 15.386, df = 98, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
  1.377958 1.786042
sample estimates:
mean of x mean of y 
    6.588     5.006

Note: t.test actually can provide more functions. If you know the population variances are unknow but different, you still can use t.test to construct the confidence interval on $\mu_1-\mu_2$ .
t.test 其实可以提供更多的功能。如果总体方差未知但不同，仍然可以使用 t.test 来构建 $\mu_1-\mu_2$ 置信区间。

# Interval Estimate of a Proportion/Difference between Proportions 比例 / 比例间差异的区间估计

A point estimator of the proportion $p$ in a binomial experiment is given by the statistic $\hat{p} =x/n$ , where $x$ represents the number of successes in $n$ trials.
在二项式实验中，比例的点估计 $p$ 值，由统计量 $\hat{p} = x/n$ 给出，其中 $x$ 表示 $n$ 次试验中的成功次数。

If the unknown proportion $p$ is not expected to be too close to 0 or 1, we can establish a confidence interval for $p$ by considering the sampling distribution of $\hat{p}$ .
如果未知比例 $p$ 预计不会太接近 0 或 1，可以通过考虑 $\hat{p} = x/n$ 的抽样分布来建立 $p$ 的置信区间。

The sample proportion $\hat{p} = x/n$ is the sample mean of these $n$ values.
样本比例 $\hat{p} = x/n$ 是这些 $n$ 值的样本均值。

Hence, by the Central Limit Theorem, for $n$ is sufficiently large, $\hat{P}$ is approximately normally distributed with mean $\mu_{\hat{P}} = p$ and variance $\sigma^2_{\hat{p}} = \frac{p(1-p)}{n}$ .
根据中心极限定理， $n$ 足够大， $\hat{P}$ 接近均值为 $\mu_{\hat{P}} = p$ ，方差为 $\sigma^2_{\hat{p}} = \frac{p(1-p)}{n}$ 的正态分布

Therefore

$Z = \frac{\hat{p}-p}{\sqrt{p(1-p)/n}} \sim \mathcal{N}(0,1)$

where $z_{\alpha/2}$ is the $z$ -value leaving an area of $\alpha/2$ to the right.
其中 $z_{\alpha/2}$ 指 $\alpha/2$ 右侧区域取得的 $z$ 值

Note: If we don't have the exact $p$ , we will use $\hat{p}$ to approximate $p$ under the radical sign.
如果 $p$ 值不确定，将 $\hat{p}$ 表示使用 $p$ 近似值

# Large-Sample Confidence Interval on $p$ 大样本比例 $p$ 的置信区间

If $\hat{p}$ is proportion of successes in a random sample of size $n$ , an approximate $100(1 -\alpha)\%$ confidence interval, for the binomial parameter $p$ is given by
如果 $\hat{p}$ 是大小为 $n$ 的随机样本中成功的比例，则二元参数 $p$ 的大致 $100(1 -\alpha)\%$ 置信区间为

$\hat{p} - z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}} < p < \hat{p} + z_{\alpha/2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},$

where $z_{\alpha/2}$ is the $z$ -value leaving an area of $\alpha/2$ to the right.
其中 $z_{\alpha/2}$ 指 $\alpha/2$ 右侧区域取得的 $z$ 值

# Example 4

In a random sample of $n = 500$ families owning television sets in the city of Hamilton, Canada, it is found that $x = 340$ subscribe to HBO.
在拥有电视机的 $n = 500$ 的家庭随机抽样中，发现平均 $x = 340$ 订阅 HBO。

Find a 95% confidence interval for the actual proportion of families with television sets in this city that subscribe to HBO.
构建本市拥有电视机的家庭的实际比例的 95% 置信区间。

	n <- 500 # sample size is large enough
	x <- 340
	phat <- x/n
	sigma <- sqrt((1-phat)*phat)
	alpha <- 0.05
	loBound <- phat - qnorm(1-alpha/2)*sigma/sqrt(n)
	loBound

[1] 0.6391123

	upBound <- phat + qnorm(1-alpha/2)*sigma/sqrt(n)
	upBound

[1] 0.7208877

Similarly, we can obtain a confidence interval on $p_1-p_2$ .

# Large-Sample Confidence Interval on $p_1-p_2$ 大样本比例差异 $p_1-p_2$ 的置信区间

If $\hat{p}_1$ and $\hat{p}_2$ are the proportions of successes in random samples of size $n_1$ and $n_2$ , respectively, $\hat{q}_1 = 1-\hat{p}_1$ , and $\hat{q}_2 = 1-\hat{p}_2$ , an approximate $100(1 -\alpha)\%$ confidence interval for the difference of two binomial parameters, $p_1-p_2$ , is given by
大小为 $n_1$ 和 $n_2$ 的两个随机样本中的成功比例分别为 $\hat{p}_1$ 和 $\hat{p}_2$ ，（失败比例） $\hat{q}_1 = 1-\hat{p}_1$ 、 $\hat{q}_2 = 1-\hat{p}_2$ ，两个二元参数的差异 $p_1-p_2$ 的 $100(1 -\alpha)\%$ 置信区间为：

$(\hat{p}_1-\hat{p}_2) - z_{\alpha/2}\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1} + \frac{\hat{p}_2\hat{q}_2}{n_2}} < p_1-p_2 < (\hat{p}_1-\hat{p}_2) + z_{\alpha/2}\sqrt{\frac{\hat{p}_1\hat{q}_1}{n_1} + \frac{\hat{p}_2\hat{q}_2}{n_2}}$

where $z_{\alpha/2}$ is the $z$ -value leaving an area of $\alpha/2$ to the right.
其中 $z_{\alpha/2}$ 指 $\alpha/2$ 右侧区域取得的 $z$ 值

# Example 5

A certain change in a process for manufacturing component parts is being considered.
在考虑对制造组件的过程进行某种更改。

Samples are taken under both the existing and the new process so as to determine if the new process results in an improvement.
在现有流程和新流程下都抽取样本，以确定新流程是否会带来改进。

If 75 of 1500 items from the existing process are found to be defective and 80 of 2000 items from the new process are found to be defective, find a 90% confidence interval for the true difference in the proportion of defectives between the existing and the new process.
如果发现现有流程的 1500 个项目中有 75 个有缺陷，而新流程的 2000 个项目中有 80 个被发现有缺陷，则找到现有流程和新流程之间缺陷比例的真实差异的 90% 置信区间过程。

	n1<-1500
	n2<-2000
	p1<- 75/n1
	p2<- 80/n2
	se<- sqrt(p1(1-p1)/n1+p2(1-p2)/n2)
	alpha <- 0.1
	loBound<- p1-p2 - qnorm(1-alpha/2)*se
	loBound

[1] -0.001731239

	upBound<- p1-p2 + qnorm(1-alpha/2)*se
	upBound

[1] 0.02173124

# In-class Exercise: Interval Estimate of a Proportion/Difference between Proportions

There are two classifiers to detect spam emails.
有两个分类器可以检测垃圾邮件。

For classifier A, 70 of 1000 emails are found to be spam; for classifier B, 100 of 1500 emails are found to be spam.
对于分类器 A，发现 1000 封电子邮件中有 70 封是垃圾邮件；对于分类器 B，发现 1500 封电子邮件中有 100 封是垃圾邮件。

Construct a 95% confidence interval for the true difference in the proportion of spam emails between these two classifiers.
为这两个分类器之间垃圾邮件比例的真实差异构建 95% 置信区间。

	n1 <- 1000
	n2 <- 1500
	p1hat <- 70/n1
	p2hat <- 100/n2
	se <- sqrt(p1hat(1-p1hat)/n1 + p2hat(1-p2hat)/n2)
	alpha <- 0.05
	loBd <- p1hat-p2hat - qnorm(1-alpha/2)*se
	loBd

[1] -0.016901

	upBd <- p1hat-p2hat + qnorm(1-alpha/2)*se
	upBd

[1] 0.02356767

# Estimating the Variance and the Ratio of Two Variances 估计方差和两个方差的比率

# Interval Estimate of the Variance 方差的区间估计

We already know $S^2$ is an unbiased the estimator of $\sigma^2$ .
已知 $S^2$ 是总体方差 $\sigma^2$ 的无偏估计量。

If a sample of size $n$ is drawn from a normal population with variance $\sigma^2$ , an interval estimate of $\sigma^2$ can be established by using the statistic
如果大小为 $n$ 的样本，取自方差 $\sigma^2$ 的总体， $\sigma^2$ 的区间估计可以使用以下公式计算

$X^2 = \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}$

# Confidence Interval for $\sigma^2$ 总体方差 $\sigma^2$ 的置信区间

If $s^2$ is the variance of a random sample of size $n$ from a $\textit{normal population}$ , a $100(1-\alpha)\%$ confidence interval for $\sigma^2$ is given by
大小为 $n$ 的随机样本取自正态分布总体，样本方差为 $s^2$ ，则总体方差 $\sigma^2$ 的 $100(1-\alpha)\%$ 置信区间可以表示为

$\frac{(n-1)S^2}{\chi^2_{\alpha/2,n-1}} < \sigma^2 < \frac{(n-1)S^2}{\chi^2_{1-\alpha/2,n-1}}$

where $\chi^2_{1-\alpha/2,n-1}$ and $\chi^2_{\alpha/2,n-1}$ are values of the chi-squared distribution with $n-1$ degrees of freedom, leaving areas of $1-\alpha/2$ and $\alpha/2$ , respectively, to the right.
其中， $\chi^2_{1-\alpha/2,n-1}$ 和 $\chi^2_{\alpha/2,n-1}$ 表示自由度为 $n-1$ 的， $1-\alpha/2$ 和 $\alpha/2$ 区域的卡方分布的值

# Example 6

The following are the weights, in decagrams, of 10 packages of grass seed distributed by a certain company: 46.4, 46.1, 45.8, 47.0, 46.1, 45.9, 45.8, 46.9, 45.2, and 46.0.
某公司经销草种 10 包重量如下

Find a 95% confidence interval for the variance of the weights of all such packages of grass seed distributed by this company, assuming a normal population.
假设为正态总体，请计算该公司分发的所有此类草种包的重量的方差的 95% 置信区间。

	weight <-c(46.4,46.1, 45.8, 47.0, 46.1, 45.9, 45.8, 46.9, 45.2,46.0)
	n<- length(weight)
	ssquare <- var(weight)
	alpha <- 0.05 # 95% confidence interval
	loBd <- (n-1)*ssquare/qchisq(1-alpha/2,n-1)
	loBd

[1] 0.1354167

	upBd <- (n-1)*ssquare/qchisq(alpha/2,n-1)
	upBd

[1] 0.9539365

# In-class Exercise: Confidence Interval for $\sigma^2$

Please construct a 90% confidence interval for the variance of Sepal.Length of all records in iris data set.
请为数据集 iris 中 Sepal.Length 所有记录的方差构建一个 90% 的置信区间。

	len <- iris$Sepal.Length
	n <- length(len)
	sampleVar <- var(len)
	alpha <- 0.1
	loBd <- (n-1)*sampleVar/qchisq(1-alpha/2,n-1)
	loBd

[1] 0.5724186

	upBd <- (n-1)*sampleVar/qchisq(alpha/2,n-1)
	upBd

[1] 0.8389097

# Estimating the Ratio of Two Variances 估计两个方差比

A point estimate of the ratio of two population variances $\sigma^2_1/\sigma_2^2$ is given by the ratio $s_1^2/s_2^2$ of the sample variances.
通过样本方差比为 $s_1^2/s_2^2$ ，对两个总体方差比 $\sigma^2_1/\sigma_2^2$ 进行点估计。

Hence, the statistic $S_1^2/S_2^2$ is called an estimator of $\sigma^2_1/\sigma_2^2$ .
因此，统计量 $S_1^2/S_2^2$ 被称为 $\sigma^2_1/\sigma_2^2$ 的估计。

If $\sigma^2_1$ and $\sigma_2^2$ are the variances of normal populations, we can using the statistic
如果 $\sigma^2_1$ 和 $\sigma_2^2$ 是 正态群体的差异，我们可以使用统计量

$F = \frac{\sigma_2^2S_1^2}{\sigma_1^2S_2^2} \sim F(n_1-1,n_2-1)$

to establish an interval estimate of $\sigma_1^2/\sigma_2^2$ .
来建立 $\sigma_1^2/\sigma_2^2$ 的区间估计值。

# Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间

If $s_1^2$ and $s_2^2$ are the variances of a random sample of size $n_1$ and $n_2$ , respectively, from normal populations, then a $100(1-\alpha)\%$ confidence interval for $\sigma_1^2/\sigma_2^2$ is given by
两个随机样本取自正态总体，大小分别为 $n_1$ 和 $n_2$ ，方差分别为 $s_1^2$ 和 $s_2^2$ 。总体方差比 $\sigma_1^2/\sigma_2^2$ 的 $100(1-\alpha)\%$ 置信区间可以表示为

$\frac{s_1^2}{s_2^2} \frac{1}{f_{\alpha/2}(n_1-1,n_2-1)} < \frac{\sigma_1^2}{\sigma_2^2} < \frac{s_1^2}{s_2^2}f_{\alpha/2}(n_2-1,n_1-1)$

where $f_{\alpha/2}(n_1-1,n_2-1)$ is an $f$ -value with $n_1-1$ and $n_2-1$ degrees of freedom, leaving an area of $\alpha/2$ to the right, and $f_{\alpha/2}(n_2-1,n_1-1)$ is an $f$ -value with $n_2-1$ and $n_1-1$ degrees of freedom.
其中， $f_{\alpha/2}(n_1-1,n_2-1)$ 指取自自由度为 $n_1-1$ 和 $n_2-1$ ， $\alpha/2$ 右侧区域的 $f$ 值； $f_{\alpha/2}(n_2-1,n_1-1)$ 指自由度 $n_2-1$ 和 $n_1-1$ 的 $f$ 值

# Example 7

A confidence interval for the difference in the mean orthophosphorus contents, measured in milligrams per liter, at two stations on the James River by assuming the normal population variance to be unequal.
河的两个站，假设正态总体的方差是不相等的，以每升毫克来测量两个站的平均磷含量的方差的置信区间。

Orthophosphorus was measured in milligrams per liter.
磷以每升毫克为单位测量。

Fifteen samples were collected from station 1, and 12 samples were obtained from station 2.
从 1 号站收集了 15 个样本，从 2 号站采集了 12 个样本。

The 15 samples from station 1 had an average orthophosphorus content of 3.84 milligrams per liter and a standard deviation of 3.07 milligrams per liter, while the 12 samples from station 2 had an average content of 1.49 milligrams per liter and a standard deviation of 0.80 milligram per liter.
1 号站的 15 个样品平均磷含量为每升 3.84 毫克，标准差为每升 3.07 毫克；2 号站的 12 个样品平均含量为每升 1.49 毫克，标准差为每升 0.80 毫克。

Justify this assumption by constructing 98% confidence intervals for $\sigma_1^2/\sigma_2^2$ and for $\sigma_1/\sigma_2$ , where $\sigma_1^2$ and $\sigma_2^2$ are the variances of the populations of
orthophosphorus contents at station 1 and station 2, respectively.
$\sigma_1^2$ 和 $\sigma_2^2$ 分别是 1 号站和 2 号站的磷含量总体方差，评估 $\sigma_1^2/\sigma_2^2$ 和 $\sigma_1/\sigma_2$ 的 98% 的置信区间

	n1 <- 15
	n2 <- 12
	s1 <- 3.07
	s2 <- 0.80
	alpha <- 0.02
	loBd <- s1^2/s2^2/qf(1-alpha/2, n1-1, n2-1)
	loBd

[1] 3.430136

	upBd <- s1^2/s2^2*qf(1-alpha/2, n2-1, n1-1)
	upBd

[1] 56.90341

Taking square roots of the confidence limits, we find that a 98% confidence interval for $\sigma_1/\sigma_2$ is
取置信区间的平方根，我们发现 $\sigma_1/\sigma_2$ 98% 的置信区间是

$1.851 < \frac{\sigma_1}{\sigma_2} < 7.549$

Since this interval does not allow for the possibility of $\sigma_1/\sigma_2$ being equal to 1, we were correct in assuming that $\sigma_1 \ne \sigma^2$ or $\sigma_1^2 \ne \sigma^2_2$ .
由于此区间不存在 $\sigma_1/\sigma_2$ 等于 1 的可能性，因此可以 $\sigma_1 \ne \sigma^2$ 或 $\sigma_1^2 \ne \sigma^2_2$ 假设是正确的

# In-class Exercise: Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间

Please construct a 96% confidence interval for the ratio of variances of Sepal.Length between two species Setosa and Virginica in iris data set, i.e., $\sigma^2_{virginica}/\sigma^2_{setosa}$ , assume Sepal.Length of two species have approximately normal distributions.
请计算两个群体 Setosa 和 Virginica 的 Sepal.Length 的方差比的 96% 置信区间。假设这两个总体近似正态分布。

	x1<-iris$Sepal.Length[iris$Species == 'virginica']
	x2<-iris$Sepal.Length[iris$Species == 'setosa']
	n1 <- length(x1)
	sampleVar1 <- var(x1)
	n2 <- length(x2)
	sampleVar2 <- var(x2)
	alpha <- 0.04
	loBd <- sampleVar1/sampleVar2/qf(1-alpha/2,n1-1, n2 -1)
	loBd

[1] 1.796658

	upBd <- sampleVar1/sampleVar2*qf(1-alpha/2,n2-1, n1 -1)
	upBd

[1] 5.89452

# Conlusions

If we know the variances, we can use $z$ value to estimate $\mu$ or $\mu_1-\mu_2$ .
如果我们知道总体方差，我们可以使用 $z$ 值估计总体均值 $\mu$ 或者总体均值差 $\mu_1-\mu_2$ 。
If we don't know the variances, we should use $t$ value to estimate $\mu$ or $\mu_1-\mu_2$ and the distributions of the populations are approximately normal.
如果我们不知道总体方差，我们应该使用 $t$ 值估计总体均值 $\mu$ 或者总体均值差 $\mu_1-\mu_2$ ，并且总体的分布近似正态。
To estimate $\sigma^2$ for normal distributions, we need $\chi^2$ distribution; To estimate $\sigma_1^2/\sigma^2_2$ , we need $F$ -distribution.
估计正态分布的 $\sigma^2$ ，需要 $\chi^2$ 分布；估计 $\sigma_1^2/\sigma^2_2$ ，需要 $F$ 分布。

# References

Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall

	n <- 20
	xbar <- 11.3
	s <- 2.45
	alpha <- 0.05
	tvalue <- qt(1-alpha/2,df = n-1)
	loBound <- xbar - tvalue *s/sqrt(n)
	loBound

# Objectives 目标

# Point Estimate (Not good) 点估计

# Interval Estimate 区间估计

# Confidence Interval on μ\muμ when σ\sigmaσ known 已知总体标准差 σ\sigmaσ，求均值 μ\muμ 的置信区间

# In-class Exercise: CI on μ\muμ with known variance 已知总体方差，求总体均值 μ\muμ 的 置信区间

# Confidence Interval on μ\muμ when σ2\sigma^2σ2 unknown 总体方差 σ2\sigma^2σ2 未知，求总体均值 μ\muμ 的置信区间

# If we don't know σ\sigmaσ 如果不知道总体标准差 σ\sigmaσ t-distribution

# Example 1 t.test

# In-class Exercise: CI on μ\muμ with unknown variance 总体方差未知，求总体均值 μ\muμ 的置信区间

# Estimating the Difference between Two Means 评价两个均值之间的差异

# Confidence Interval on μ1−μ2\mu_1-\mu_2μ1​−μ2​ when σ12\sigma_1^2σ12​ and σ22\sigma_2^2σ22​ known 总体方差 σ12\sigma_1^2σ12​ 和 σ22\sigma_2^2σ22​ 已知时，求总体均值差异 μ1−μ2\mu_1-\mu_2μ1​−μ2​ 的置信区间

# Example 2

# In-class Exercise: CI on μ1−μ2\mu_1-\mu_2μ1​−μ2​ with known σ1\sigma_1σ1​ and σ2\sigma_2σ2​ 已知总体方差 σ1\sigma_1σ1​ 和 σ2\sigma_2σ2​，求总体均值差异 μ1−μ2\mu_1-\mu_2μ1​−μ2​ 的置信区间

# Confidence Interval on μ1−μ2\mu_1-\mu_2μ1​−μ2​ when σ12=σ22\sigma_1^2 = \sigma_2^2σ12​=σ22​ but Both Unknown 已知 σ12=σ22\sigma_1^2 = \sigma_2^2σ12​=σ22​ 但两者都未知，求μ1−μ2\mu_1-\mu_2μ1​−μ2​ 的置信区间

# Example 3 t.test

# In-class Exercise A: CI on μ1−μ2\mu_1-\mu_2μ1​−μ2​ with unknown σ1=σ2\sigma_1 = \sigma_2σ1​=σ2​

# In-class Exercise B: CI on μ1−μ2\mu_1-\mu_2μ1​−μ2​ with unknown σ1=σ2\sigma_1 = \sigma_2σ1​=σ2​

# Interval Estimate of a Proportion/Difference between Proportions 比例 / 比例间差异的区间估计

# Large-Sample Confidence Interval on ppp 大样本比例ppp 的置信区间

# Example 4

# Large-Sample Confidence Interval on p1−p2p_1-p_2p1​−p2​ 大样本比例差异p1−p2p_1-p_2p1​−p2​ 的置信区间

# Example 5

# In-class Exercise: Interval Estimate of a Proportion/Difference between Proportions

# Estimating the Variance and the Ratio of Two Variances 估计方差和两个方差的比率

# Interval Estimate of the Variance 方差的区间估计

# Confidence Interval for σ2\sigma^2σ2 总体方差 σ2\sigma^2σ2 的置信区间

# Example 6

# In-class Exercise: Confidence Interval for σ2\sigma^2σ2

# Estimating the Ratio of Two Variances 估计两个方差比

# Confidence Interval for σ12/σ22\sigma_1^2/\sigma_2^2σ12​/σ22​ 总体方差比 σ12/σ22\sigma_1^2/\sigma_2^2σ12​/σ22​ 的置信区间

# Example 7

# In-class Exercise: Confidence Interval for σ12/σ22\sigma_1^2/\sigma_2^2σ12​/σ22​ 总体方差比 σ12/σ22\sigma_1^2/\sigma_2^2σ12​/σ22​ 的置信区间

# Conlusions

# References

Lecture 3. JavaScript – Namespace

Week 8. Hypothesis Testing 假设检验

# Confidence Interval on $\mu$ when $\sigma$ known 已知总体标准差 $\sigma$ ，求均值 $\mu$ 的置信区间

# In-class Exercise: CI on $\mu$ with known variance 已知总体方差，求总体均值 $\mu$ 的置信区间

# Confidence Interval on $\mu$ when $\sigma^2$ unknown 总体方差 $\sigma^2$ 未知，求总体均值 $\mu$ 的置信区间

# If we don't know $\sigma$ 如果不知道总体标准差 $\sigma$ `t-distribution`

# Example 1 `t.test`

# In-class Exercise: CI on $\mu$ with unknown variance 总体方差未知，求总体均值 $\mu$ 的置信区间

# Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2$ and $\sigma_2^2$ known 总体方差 $\sigma_1^2$ 和 $\sigma_2^2$ 已知时，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

# In-class Exercise: CI on $\mu_1-\mu_2$ with known $\sigma_1$ and $\sigma_2$ 已知总体方差 $\sigma_1$ 和 $\sigma_2$ ，求总体均值差异 $\mu_1-\mu_2$ 的置信区间

# Confidence Interval on $\mu_1-\mu_2$ when $\sigma_1^2 = \sigma_2^2$ but Both Unknown 已知 $\sigma_1^2 = \sigma_2^2$ 但两者都未知，求 $\mu_1-\mu_2$ 的置信区间

# Example 3 `t.test`

# In-class Exercise A: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

# In-class Exercise B: CI on $\mu_1-\mu_2$ with unknown $\sigma_1 = \sigma_2$

# Large-Sample Confidence Interval on $p$ 大样本比例 $p$ 的置信区间

# Large-Sample Confidence Interval on $p_1-p_2$ 大样本比例差异 $p_1-p_2$ 的置信区间

# Confidence Interval for $\sigma^2$ 总体方差 $\sigma^2$ 的置信区间

# In-class Exercise: Confidence Interval for $\sigma^2$

# Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间

# In-class Exercise: Confidence Interval for $\sigma_1^2/\sigma_2^2$ 总体方差比 $\sigma_1^2/\sigma_2^2$ 的置信区间