# Objectives 目标

  • Understand Bernoulli Distribution and Binoimal Distribution
    了解伯努利分布二项分布
    Bernoulli and Binomial distributions are related to experiments with binary outcomes (success or failure).
    伯努利分布和二项分布与二元结果(成功或失败)的实验有关。
  • Understand the interpretation of these distributions
    理解这些分布的解释
  • Know how to do the simulations of these two experiments and barplot
    知道如何做这两个实验的模拟和 barplot
  • Know how to use sample to generate training and testing data set
    知道如何使用 sample 生成训练和测试数据集

# Bernoulli Distribution 伯努利分布

Bernoulli Experiment 伯努利实验
A single random experiment with outcome either success or failure, and the probability of the success is pp.
一个随机实验,结果要么成功要么失败,成功的概率是 pp
Examples: Flip a single coin; Attempt a free throw
例子: 抛一枚硬币;尝试罚球
Bernoulli Random Variable 伯努利随机变量
A random variable that takes either 11 or 00, where it takes 1 when the outcome of the Bernoulli trial is a success and 0 when the outcome of the Bernoulli trial is a failure.
一个随机变量 11 或者 00,当伯努利试验的结果成功时取 1,当伯努利试验的结果失败时取 0。
Bernoulli Distribution 伯努利分布
The distribution of a Bernoulli random variable is called the Bernoulli distribution.
伯努利随机变量的分布称为伯努利分布。
  • Probability Mass Function of Bernoulli Distribution/Random Variable:
    伯努利分布 / 随机变量的概率质量函数:
xx01
f(x)f(x)1p1-ppp
  • Expectation and Variance 期望值和方差值

μ=E(X)=(0)(1p)+(1)p=p,σ2=var(X)=(0)2(1p)+(1)2pp2=p(1p)\mu = E(X) = (0)(1-p)+(1)p = p, \\ \sigma^2 = var(X) = (0)^2(1-p)+(1)^2p -p^2 = p(1-p)

# A Simple Example: We toss a fair coin. 扔一枚公平的硬币

  • Each time of the tossing coin is a Bernoulli Experiment. Consider head is success (1).
    每次抛硬币都是伯努利实验。考虑头像面是成功(1)。

  • What is the probability pp?
    概率pp 是多少?

    n <- 10 # number of random experiments
    x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
    coinToss <- sample(x, size=n, replace=TRUE)
    # generate n outcomes of the experiments
    mean(coinToss)
    [1] 0.4
    
  • What about 100 times? 100 次呢?

    n <- 100 # number of random experiments
    x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
    coinToss <- sample(x, size=n, replace=TRUE)
    # generate n outcomes of the experiments
    mean(coinToss)
    [1] 0.5
    
  • Seems close to pp. Let's try n=1000000n = 1000000.
    似乎接近pp 了,再试试 n=1000000n = 1000000

    n <- 1000000 # number of random experiments
    x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
    coinToss <- sample(x, size=n, replace=TRUE)
    # generate n outcomes of the experiments
    mean(coinToss)
    [1] 0.5
    
  • We can use a barplot to generate the frequence for coinToss
    我们可以使用 abarplot 来生成频率 coinToss

    barplot(table(coinToss)) #

# the sample function is also used to split a data set

sample 函数还用于拆分数据集

例如机器学习数据挖掘部分,将考虑 80% 作为训练数据,20% 作为测试数据。

cars
i <- sample(2, nrow(cars), replace=TRUE, prob=c(0.8, 0.2)) # prob 后面跟着的是两个概率
carsTrainingData <- cars[i==1,] # 等于 1 的数据归于训练集
carsTestingData <- cars[i==2,] # 等于 2 的数据归于测试集
summary(cars)
    speed           dist       
Min.   : 4.0   Min.   :  2.00  
1st Qu.:12.0   1st Qu.: 26.00  
Median :15.0   Median : 36.00  
Mean   :15.4   Mean   : 42.98  
3rd Qu.:19.0   3rd Qu.: 56.00  
Max.   :25.0   Max.   :120.00  
summary(carsTrainingData)
    speed            dist      
Min.   : 4.00   Min.   :  2.0  
1st Qu.:12.00   1st Qu.: 26.0  
Median :15.00   Median : 35.0  
Mean   :15.19   Mean   : 41.9  
3rd Qu.:18.75   3rd Qu.: 54.0  
Max.   :24.00   Max.   :120.0  
summary(carsTestingData)
    speed           dist      
Min.   : 7.0   Min.   :16.00  
1st Qu.:11.0   1st Qu.:23.50  
Median :18.0   Median :52.00  
Mean   :16.5   Mean   :48.62  
3rd Qu.:21.0   3rd Qu.:68.50  
Max.   :25.0   Max.   :85.00  

# In-class Exercise

  1. Please use the above the code to divide the iris data set into two parts, 80%80\% in training data set and 20%20\% in testing data set

    请使用上面的代码将 iris 数据集分成两部分,80%80\% 在训练数据集和 20%20\% 在测试数据集

    iris
    i <- sample(2, nrow(iris), replace=TRUE, prob=c(0.8, 0.2)) # prob 后面跟着的是两个概率
    irisTrainingData <- iris[i==1,] # 等于 1 的数据归于训练集
    irisTestingData <- iris[i==2,] # 等于 2 的数据归于测试集
    summary(iris)
     Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
    Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
    1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
    Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
    Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
    3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
    Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500             
    
  2. Do some data exploration in these two data sets. E.g. summary , histogram , boxplot , or barplot

    在这两个数据集中做一些数据探索。

    summary(irisTrainingData)
     Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
    Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :40  
    1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:44  
    Median :5.800   Median :3.000   Median :4.400   Median :1.400   virginica :41  
    Mean   :5.853   Mean   :3.074   Mean   :3.783   Mean   :1.217                  
    3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800                  
    Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500          
    
    summary(irisTestingData)
     Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
    Min.   :4.400   Min.   :2.200   Min.   :1.200   Min.   :0.200   setosa    :10  
    1st Qu.:5.000   1st Qu.:2.600   1st Qu.:1.500   1st Qu.:0.200   versicolor: 6  
    Median :5.700   Median :3.000   Median :4.000   Median :1.300   virginica : 9  
    Mean   :5.796   Mean   :2.976   Mean   :3.632   Mean   :1.112                  
    3rd Qu.:6.300   3rd Qu.:3.200   3rd Qu.:5.200   3rd Qu.:1.800                  
    Max.   :7.700   Max.   :3.900   Max.   :6.700   Max.   :2.300  
    
    par(mfrow=c(1,3))
    hist(iris$Sepal.Length)
    hist(irisTrainingData$Sepal.Length)
    hist(irisTestingData$Sepal.Length)

    par(mfrow=c(1,3))
    boxplot(iris$Petal.Width)
    boxplot(irisTrainingData$Petal.Width)
    boxplot(irisTestingData$Petal.Width)

    par(mfrow=c(1,3))
    barplot(table(iris$Species))
    barplot(table(irisTrainingData$Species))
    barplot(table(irisTestingData$Species))

# If we have an unfair coin 如果扔一个不公平的硬币

what should we do? 我们应该怎么做?

  • Suppose we have one unfair coin, whose probabilities of heads are p=0.25p = 0.25.
    假设我们有一枚不公平的硬币,其正面概率为p=0.25p = 0.25.

  • How many heads will we get, if we flip the coin 10 times? 100 times? 1000 times?
    如果我们抛硬币 10 次,我们会得到多少个正面?抛 100 次?抛 1000 次?

    unfaircoin
    p = 0.25;
    n = 10;
    x = c(0,1)
    unfairCoinToss <- sample(x, size=n, replace=TRUE, prob = c(1-p,p))
    sum(unfairCoinToss)
    [1] 2
    
    n = 100;
    unfairCoinToss <- sample(x, size=n, replace=TRUE, prob = c(1-p,p))
    sum(unfairCoinToss)
    [1] 30
    
    n = 1000;
    unfairCoinToss <- sample(x, size=n, replace=TRUE, prob = c(1-p,p))
    sum(unfairCoinToss)
    [1] 241
    
  • Question: Do these results make sense?
    问题:这些结果有意义吗?

# In-class Exercise

  1. Suppose I have an unfair coin, whose probabilities of heads are p=0.6p = 0.6. Can you modify the code above to find the number of heads when flipping the code 1e61e6 times?

    假设我有一枚不公平的硬币,其正面概率为 p=0.6p = 0.6. 能不能修改上面的代码,求抛 1e61e6 次正面朝上的次数?

    p = 0.6
    n = 1e6
    x = c(0, 1)
    unfairCoinToss <- sample(x, size=n, replace=TRUE, prob = c(1-p,p))
    sum(unfairCoinToss)

# Binomial Distribution 二项分布

Binomial Experiment 二项式实验
Repeat Bernoulli experiment independently for a certain number (nn) of times, each repetition has the same possible outcomes (success or failure), and the probability of each success (pp) is consistent for all trials. This experiment is called Binomial trial.
独立重复伯努利实验一定数量(nn) 次,每次重复都有相同的可能结果(成功或失败),以及每次成功的概率(pp) 对于所有试验都是一致的。这个实验叫做二项式试验。
Binomial Random Variable 二项式随机变量
The number of XX of a binomial experiment is called a Binomial random variable.
二项式实验结果的发生次数 X 称为二项式随机变量。
Binomial Distribution 二项式分布
The distribution of a Binomial random variable is called the binomial distribution
二项式随机变量的分布称为二项式分布

# Examples: Number of heads in 40 coin flips, Number of hits in 20 free throws

示例:40 次抛硬币的正面次数,20 次罚球的命中次数

  • PMF: b(x;n,p)=(nx)px(1p)nxb(x;n,p) = {n\choose x}p^x(1-p)^{n-x}, x=0,1,,nx=0,1,\ldots,n

  • Expectation and Variance 期望值和方差值

    μ=E(X)=np,σ2=var(X)=np(1p).\mu = E(X) = np, \\ \sigma^2 = var(X) = np(1-p).

    • Let Yi,i=1,,nY_i, i = 1, \ldots, n denote the Bernoulli random variable.
      nn 表示伯努利随机变量。

    • This is becasue YiY_i are independent and identical distributed (iid).
      这是因为 YiY_i 是独立同分布的(iid)。

E(X) & = E(Y_1+\cdots+Y_n) = E(Y_1)+\cdots+E(Y_n) = np,\\ var(X) & = var(Y_1+\cdots+Y_n) = var(Y_1)+\cdots+var(Y_n) = np(1-p).

# Another Simple Example: We toss a fair coin 扔一枚公平的硬币

  • Each time of the tossing coin is a Bernoulli Experiment. Consider head is success (1).
    每次抛硬币都是伯努利实验。考虑头像面是成功(1)。

  • Suppose we toss this coin 10 times, how many heads can we get?
    假设我们抛这枚硬币 10 次,我们能得到几个正面?

    n <- 10 # number of random experiments
    x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
    coinToss <- sample(x, size=n, replace=TRUE)
    # generate n outcomes of the experiments
    sum(coinToss)
    [1] 6
    
  • 1000000 times?

    n <- 10000000 # number of random experiments
    x <- c(0,1) # sample space for tossing a coin, 0--Tail, 1--Head
    coinToss <- sample(x, size=n, replace=TRUE)
    # generate n outcomes of the experiments
    sum(coinToss)
    [1] 5001636
    
  • Let's explore how this value varies in repeated experiments.
    这个值在重复实验中是如何变化的

    distrib-of-heads
    n <- 10000
    x <- c(0,1) 
    sumsInRepeatedSampling10Reps<-replicate(10, sum(sample(x, size=n, replace=TRUE)))
    sumsInRepeatedSampling100Reps<-replicate(100, sum(sample(x, size=n, replace=TRUE)))
    sumsInRepeatedSampling1000Reps<-replicate(1000, sum(sample(x, size=n, replace=TRUE)))
    par(mfrow=c(1,3))
    hist(sumsInRepeatedSampling10Reps,main = paste("Number of heads- 10 reps"))
    hist(sumsInRepeatedSampling100Reps,main = paste("Number of heads- 100 reps"))
    hist(sumsInRepeatedSampling1000Reps,main = paste("Number of heads- 1000 reps"))

# Conclusions 结论

The number of heads observed--we know it is XX--when tossing a coin nn times is a statistics computed from the sample of nn coin tosses.
头像面数 XX,抛硬币nn 次是从抛nn 次硬币样本计算的统计数据。

XX should follow the Binomial distribution.
XX 应该服从二项分布

The experiments above suggest that the sampling distribution of XX appears to have a mean around the expected number of heads when a fair coin is tossed, which is about n/2n/2 or npnp.
上述实验表明,抽样分布XX 当抛一枚公平的硬币时,似乎在预期的正面次数附近有一个平均值,大约是n/2n/2 或者npnp

The more times we repeat the experiment of nn coin tosses, the closer XX gets to its expected value -- this can be measured by looking at both the mean and the variance of XX
我们重复掷硬币的实验次数nn 越多,XX 离其预期值越近 —— 这可以通过查看 XX 的均值和方差来衡量

distrib-of-heads-continued
meansOfX<-c(mean(sumsInRepeatedSampling10Reps),
mean(sumsInRepeatedSampling100Reps),
mean(sumsInRepeatedSampling1000Reps))
varsOfX<-c(var(sumsInRepeatedSampling10Reps),var(sumsInRepeatedSampling100Reps),var(sumsInRepeatedSampling1000Reps))
meansOfX
[1] 4967.200 4993.110 5001.822
varsOfX
[1] 3494.400 2393.452 2355.378

Question: is it possible that something similar to this always happens?
问题:是否有可能总是发生类似的事情?

As we will see, the sampling distribution of XX is approximately normal with mean equal to the expected value of XX. In other words, the example above illustrates a known result--the Central Limit Theorem, one of the cornerstone results used in inference!
正如我们将看到的,抽样分布 XX 近似正态,均值等于期望值XX。 换句话说,上面的例子说明了一个已知的结果 —— 中心极限定理,这是推理中使用的基石结果之一!

# In-class Exercise: Roll A Die

  1. Suppose we have a fair die. Modify the code above to simulate rolling a die 10 times, 1000 times, and 100000 times.

    假设我们有一个公平的骰子。修改上面的代码来模拟掷骰子 10 次、1000 次和 100000 次。

    10 times
    n <- 10
    x <- 1:6
    dieRoll <- sample(x, size=n, replace=TRUE)
    barplot(table(dieRoll))
    mean(dieRoll)
    n <- 1000
    dieRoll <- sample(x, size=n, replace=TRUE)
    barplot(table(dieRoll))
    mean(dieRoll)
    n <- 100000
    dieRoll <- sample(x, size=n, replace=TRUE)
    barplot(table(dieRoll))
    mean(dieRoll)
  2. Do a barplot of the results in 1

    对 1 中的结果进行条形图



  3. Calculate the mean of the results in 1.

    计算 1 中结果的平均值。

    [1] 4.2
    [1] 3.435
    [1] 3.49603
    
  4. Set n=10000n = 10000 to calculate the mean value. Then repeat your similuation 10, 100, and 1000 times. Save the results and do a histogram.

    设定 n=10000n = 10000 计算平均值。然后重复模拟 10、100 和 1000 次。保存结果并绘制直方图。

    n <- 10000
    x = 1:6;
    meanInRepeatedSampling10Reps<-replicate(10, mean(sample(x, size=n, replace=TRUE)))
    meanInRepeatedSampling100Reps<-replicate(100, mean(sample(x, size=n, replace=TRUE)))
    meanInRepeatedSampling1000Reps<-replicate(1000, mean(sample(x, size=n, replace=TRUE)))
    par(mfrow=c(1,3))
    hist(meanInRepeatedSampling10Reps,main = paste("Mean of DieRolling- 10 reps"))
    hist(meanInRepeatedSampling100Reps,main = paste("Mean of DieRolling reps"))
    hist(meanInRepeatedSampling1000Reps,main = paste("Mean of DieRolling- 1000 reps"))

  5. One chanllenge question: Suppose this die is unfair. We can get number 1,3, and 5 with probability 1/91/9 and 2, 4, and 6 with probability 2/92/9. How can we do the similar experiments as the in-class exercise?

    假设这个骰子是不公平的。我们可以以1/91/9 概率得到数字 1、3 和 5, 2/92/9 的概率得到 2、4 和 6。 我们如何做类似的实验?