# # Objectives 目标

• Understand Bernoulli Distribution and Binoimal Distribution
了解伯努利分布二项分布
Bernoulli and Binomial distributions are related to experiments with binary outcomes (success or failure).
伯努利分布和二项分布与二元结果（成功或失败）的实验有关。
• Understand the interpretation of these distributions
理解这些分布的解释
• Know how to do the simulations of these two experiments and barplot
知道如何做这两个实验的模拟和 barplot
• Know how to use sample to generate training and testing data set
知道如何使用 sample 生成训练和测试数据集

# # Bernoulli Distribution 伯努利分布

Bernoulli Experiment 伯努利实验
A single random experiment with outcome either success or failure, and the probability of the success is $p$.

Examples: Flip a single coin; Attempt a free throw

Bernoulli Random Variable 伯努利随机变量
A random variable that takes either $1$ or $0$, where it takes 1 when the outcome of the Bernoulli trial is a success and 0 when the outcome of the Bernoulli trial is a failure.

Bernoulli Distribution 伯努利分布
The distribution of a Bernoulli random variable is called the Bernoulli distribution.

• Probability Mass Function of Bernoulli Distribution/Random Variable:
伯努利分布 / 随机变量的概率质量函数：
 $x$ 0 1 $f(x)$ $1-p$ $p$
• Expectation and Variance 期望值和方差值

$\mu = E(X) = (0)(1-p)+(1)p = p, \\ \sigma^2 = var(X) = (0)^2(1-p)+(1)^2p -p^2 = p(1-p)$

## # A Simple Example: We toss a fair coin. 扔一枚公平的硬币

• Each time of the tossing coin is a Bernoulli Experiment. Consider head is success (1).
每次抛硬币都是伯努利实验。考虑头像面是成功（1）。

• What is the probability $p$?
概率$p$ 是多少？

[1] 0.4

• What about 100 times? 100 次呢？

[1] 0.5

• Seems close to $p$. Let's try $n = 1000000$.
似乎接近$p$ 了，再试试 $n = 1000000$

[1] 0.5

• We can use a barplot to generate the frequence for coinToss
我们可以使用 abarplot 来生成频率 coinToss

## # the sample function is also used to split a data set

sample 函数还用于拆分数据集

    speed           dist
Min.   : 4.0   Min.   :  2.00
1st Qu.:12.0   1st Qu.: 26.00
Median :15.0   Median : 36.00
Mean   :15.4   Mean   : 42.98
3rd Qu.:19.0   3rd Qu.: 56.00
Max.   :25.0   Max.   :120.00

    speed            dist
Min.   : 4.00   Min.   :  2.0
1st Qu.:12.00   1st Qu.: 26.0
Median :15.00   Median : 35.0
Mean   :15.19   Mean   : 41.9
3rd Qu.:18.75   3rd Qu.: 54.0
Max.   :24.00   Max.   :120.0

    speed           dist
Min.   : 7.0   Min.   :16.00
1st Qu.:11.0   1st Qu.:23.50
Median :18.0   Median :52.00
Mean   :16.5   Mean   :48.62
3rd Qu.:21.0   3rd Qu.:68.50
Max.   :25.0   Max.   :85.00


### # In-class Exercise

1. Please use the above the code to divide the iris data set into two parts, $80\%$ in training data set and $20\%$ in testing data set

请使用上面的代码将 iris 数据集分成两部分，$80\%$ 在训练数据集和 $20\%$ 在测试数据集

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50
Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

2. Do some data exploration in these two data sets. E.g. summary , histogram , boxplot , or barplot

在这两个数据集中做一些数据探索。

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :40
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:44
Median :5.800   Median :3.000   Median :4.400   Median :1.400   virginica :41
Mean   :5.853   Mean   :3.074   Mean   :3.783   Mean   :1.217
3rd Qu.:6.400   3rd Qu.:3.400   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species
Min.   :4.400   Min.   :2.200   Min.   :1.200   Min.   :0.200   setosa    :10
1st Qu.:5.000   1st Qu.:2.600   1st Qu.:1.500   1st Qu.:0.200   versicolor: 6
Median :5.700   Median :3.000   Median :4.000   Median :1.300   virginica : 9
Mean   :5.796   Mean   :2.976   Mean   :3.632   Mean   :1.112
3rd Qu.:6.300   3rd Qu.:3.200   3rd Qu.:5.200   3rd Qu.:1.800
Max.   :7.700   Max.   :3.900   Max.   :6.700   Max.   :2.300


## # If we have an unfair coin 如果扔一个不公平的硬币

what should we do? 我们应该怎么做？

• Suppose we have one unfair coin, whose probabilities of heads are $p = 0.25$.
假设我们有一枚不公平的硬币，其正面概率为$p = 0.25$.

• How many heads will we get, if we flip the coin 10 times? 100 times? 1000 times?
如果我们抛硬币 10 次，我们会得到多少个正面？抛 100 次？抛 1000 次？

[1] 2

[1] 30

[1] 241

• Question: Do these results make sense?
问题：这些结果有意义吗？

### # In-class Exercise

1. Suppose I have an unfair coin, whose probabilities of heads are $p = 0.6$. Can you modify the code above to find the number of heads when flipping the code $1e6$ times?

假设我有一枚不公平的硬币，其正面概率为 $p = 0.6$. 能不能修改上面的代码，求抛 $1e6$ 次正面朝上的次数？

# # Binomial Distribution 二项分布

Binomial Experiment 二项式实验
Repeat Bernoulli experiment independently for a certain number ($n$) of times, each repetition has the same possible outcomes (success or failure), and the probability of each success ($p$) is consistent for all trials. This experiment is called Binomial trial.

Binomial Random Variable 二项式随机变量
The number of $X$ of a binomial experiment is called a Binomial random variable.

Binomial Distribution 二项式分布
The distribution of a Binomial random variable is called the binomial distribution

## # Examples: Number of heads in 40 coin flips, Number of hits in 20 free throws

• PMF: $b(x;n,p) = {n\choose x}p^x(1-p)^{n-x}$, $x=0,1,\ldots,n$

• Expectation and Variance 期望值和方差值

$\mu = E(X) = np, \\ \sigma^2 = var(X) = np(1-p).$

• Let $Y_i, i = 1, \ldots, n$ denote the Bernoulli random variable.
$n$ 表示伯努利随机变量。

• This is becasue $Y_i$ are independent and identical distributed (iid).
这是因为 $Y_i$ 是独立同分布的（iid）。

E(X) & = E(Y_1+\cdots+Y_n) = E(Y_1)+\cdots+E(Y_n) = np,\\ var(X) & = var(Y_1+\cdots+Y_n) = var(Y_1)+\cdots+var(Y_n) = np(1-p).

## # Another Simple Example: We toss a fair coin 扔一枚公平的硬币

• Each time of the tossing coin is a Bernoulli Experiment. Consider head is success (1).
每次抛硬币都是伯努利实验。考虑头像面是成功（1）。

• Suppose we toss this coin 10 times, how many heads can we get?
假设我们抛这枚硬币 10 次，我们能得到几个正面？

[1] 6

• 1000000 times?

[1] 5001636

• Let's explore how this value varies in repeated experiments.
这个值在重复实验中是如何变化的

## # Conclusions 结论

The number of heads observed--we know it is $X$--when tossing a coin $n$ times is a statistics computed from the sample of $n$ coin tosses.

$X$ should follow the Binomial distribution.
$X$ 应该服从二项分布

The experiments above suggest that the sampling distribution of $X$ appears to have a mean around the expected number of heads when a fair coin is tossed, which is about $n/2$ or $np$.

The more times we repeat the experiment of $n$ coin tosses, the closer $X$ gets to its expected value -- this can be measured by looking at both the mean and the variance of $X$

[1] 4967.200 4993.110 5001.822

[1] 3494.400 2393.452 2355.378


Question: is it possible that something similar to this always happens?

As we will see, the sampling distribution of $X$ is approximately normal with mean equal to the expected value of $X$. In other words, the example above illustrates a known result--the Central Limit Theorem, one of the cornerstone results used in inference!

### # In-class Exercise: Roll A Die

1. Suppose we have a fair die. Modify the code above to simulate rolling a die 10 times, 1000 times, and 100000 times.

假设我们有一个公平的骰子。修改上面的代码来模拟掷骰子 10 次、1000 次和 100000 次。

2. Do a barplot of the results in 1

对 1 中的结果进行条形图

3. Calculate the mean of the results in 1.

计算 1 中结果的平均值。

[1] 4.2
[1] 3.435
[1] 3.49603

4. Set $n = 10000$ to calculate the mean value. Then repeat your similuation 10, 100, and 1000 times. Save the results and do a histogram.

设定 $n = 10000$ 计算平均值。然后重复模拟 10、100 和 1000 次。保存结果并绘制直方图。

5. One chanllenge question: Suppose this die is unfair. We can get number 1,3, and 5 with probability $1/9$ and 2, 4, and 6 with probability $2/9$. How can we do the similar experiments as the in-class exercise?

假设这个骰子是不公平的。我们可以以$1/9$ 概率得到数字 1、3 和 5， $2/9$ 的概率得到 2、4 和 6。 我们如何做类似的实验？