# # Objectives 目标

• Understand the difference bewteen Population and Sample
了解人群样本之间的差异
• Know how to compute Sample Mean , Sample Median, and Sample Variance; How to interpret these measures
知道如何计算样本均值样本中值样本方差；如何解释这些方法
• Know how to creat a histogram and boxplot in R; Understand the interpretation of these two plots
知道如何在 R 中创建直方图箱线图；理解这两个情节的解释

# # Introduction to Statistics 统计学概论

What is statistics? 什么是统计？
Statistics is the study of the collection, organization, analysis, and interpretation of data.

## # Basics of statistics 统计学基础

Population 人群
the entire group of individuals that we want information about.

Sample 样本
a part of the population that we actually examine in order to gather information about the whole population.

## # Types of statistics 统计类型

Descriptive statistics 描述性统计
utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present the information in a convenient form.

Inferential statistics 推论统计
use a fact about a sample to estimate the truth about the whole population.

### # Descriptive Statistics 描述性统计

General two types of data:

Qualitative data 定性数据
observations that cannot be measured on a numerical scale. They can only be classified into one of a group of categories.

Example: species of fish, eye color, marital status etc.

Quantitative data 定量数据
measurements that are recorded on a naturally occurring numerical scale.

Example: height of person, score of test, etc.

# # Numerical Methods 数值方法

• Consider a quantitative dataset with $n$ observations, denoted as $\{x_1,\cdots,x_n\}$.
定义一个定量数据集$n = \{ x_1, \cdots ,x_n \}$

## # Location Measures 位置测量

Sample Mean 样本均值
arithmetic average, denoted by $\bar{x}$

The corresponding population parameter is population mean, denoted by $\mu$.

$\bar{x} = \frac{x_1+\cdots+x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i=\frac{1}{n}\sum x_i$

Sample median 样本中位数
middle number when the observations are arranged in ascending order, noted by $\tilde{x}$.

$\tilde{x} = \begin{cases} x_{(n+1)/2}, & \text{ if } n \text{ is odd,} \\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), & \text{ if } n \text{ is even.} \end{cases}$

Median is less sensitive than mean to extremely large or small observations, which is good.

For example, a dataset $\{-3,0,4\}$, the sample mean is $1/3$ and the sample median is $0$.

If the observation 4 is changed to 18, then the sample mean becomes $5$, while the sample median stays unchanged as $0$.

## # Variability Measures 变异性度量

Sample range 样本范围
measure of variability, $x_{\max} -x_{\min}$

Sample variance 样本方差
measure of variability, spread out, denoted by $s^2$

The corresponding population parameter is population variance, denoted as $\sigma^2$.

$s^2 = \frac{(x_1-\bar{x})^2+\cdots+(x_n-\bar{x})^2}{n-1} = \frac{\sum\limits_{i=1}^n (x_i-\bar{x})^2}{n-1}$

Sample standard deviation 样本标准差
$s = \sqrt{s^2}$, common way for measuring how far observations are away from the mean.

The corresponding population parameter is the population standard deviation, denoted as $\sigma$.

## # 使用 R 来计算

• Consider a tiny dataset with three observations 0, 1 and 5. Find the sample mean, sample median, sample variance, and sample standard deviation.
考虑一个包含三个观测值 0、1 和 5 的小数据集。找出样本均值、样本中值、样本方差和样本标准差。

$\bar{x} = (0+1+5)/3 = 2$

$\tilde{x} = 1$

$s^2 = \frac{1}{3-1}\left[(0-2)^2+(1-2)^2+(5-1)^2\right] = \frac{4+1+9}{3-1} = 7$

$s = \sqrt{7}$

• How to use R find the sample mean, sample median, and sample variance?
如何使用 R 求样本均值、样本中位数和样本方差？

Let's check a very small dataset iris , which comes with the default installation of R. To check the sample mean and sample median of the data, just type summary(iris) .
检查一个非常小的数据集 iris ，它随 R 的默认安装一起提供。要检查数据的样本均值和样本中值，只需键入 summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
Median :5.800   Median :3.000   Median :4.350   Median :1.300
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
Species
setosa    :50
versicolor:50
virginica :50

• Use var to find the variance of one attribute Petal.Width .
使用 var 找到属性 Petal.Width 的方差。

[1] 0.5810063


# # Graphical Methods 图形方法

## # Frequency table 频率表

The following is a table, which specifies the life of 40 similar car batteries recorded to the nearest tenth of a year. The batteries are guaranteed to last 3 years.

## # Histogram 直方图

• Graphically displays the contents in the frequency table
以图形方式显示频率表中的内容

• The class intervals in the frequency table form the scale of the horizontal axis.
频率表中的类间隔形成了横轴的刻度。

• A vertical bar is placed over each class interval, with height equal to either the class frequency or class relative frequency.
在每个班级间隔上放置一个垂直条，其高度等于班级频率或班级相对频率。

• Used to plot the density of the data
用于绘制数据的密度

• How to plot histogram in R? Use the function hist . The following is a histogram of Petal.Width in iris dataset.
如何在 R 中绘制直方图？使用函数 hist 。下面是直方图 Petal.Widthiris 数据集。

## # Boxplot 箱形图

• Graphically depicting groups of numerical data through their quartiles.
通过四分位数以图形方式描述数字数据组。

first quartile Q1/25th Percentile 第一个四分位数
the middle number between the smallest number (not the “minimum”) and the median of the dataset.
最小数字（不是 “最小值”）和数据集的中位数之间的中间数字。
third quartile Q3/75th Percentile 第三四分位数
the middle value between the median and the highest value (not the “maximum”) of the dataset.
数据集的中位数和最大值（不是 “最大值”）之间的中间值。
interquartile range IQR 四分位距
25th to the 75th percentile.
(IQR)：第 25 到第 75 个百分位数。
“maximum”
Q3 + 1.5*IQR
“minimum”
Q1 -1.5*IQR
• Use the boxplot function to create a boxplot of Petal Width.
使用 boxplot 函数创建花瓣宽度的箱线图。

• par(mfrow = c(1, 2)) creates a grid of size 1x2 for plots; it divides the plot area into a grid so you see several plots on the same page as opposed to separately. Try changing the 1 and the 2 to something else!
par(mfrow = c(1, 2)) 为绘图创建一个大小为 1x2 的网格；它将绘图区域划分为一个网格，因此你可以在同一页面上看到多个绘图，而不是单独显示。尝试将 1 和 2 更改为其他内容！

# # In-class exercise 课堂练习

## # Variance 方差

1. Open RStudio on your machine
2. File > New File > R Markdown ...
3. Modify summary(cars) in the first code block to find the variance of Sepal.Length
在第一个代码块中修改 summary(cars) 以找到 Sepal.Length 的方差
4. Click Knit HTML to produce an HTML file.
5. Save your Rmd file as InClassEx2.Rmd

## # Histogram 直方图

1. Keep working on your Rmd file InClassEx2.Rmd
继续处理你的 Rmd 文件 InClassEx2.Rmd
2. Use the hist function to create a histogram of Petal Length.
使用该 hist 函数创建花瓣长度的直方图。

### # Boxplot 箱线图

1. Keep working on your Rmd file InClassEx2.Rmd

2. Use the boxplot function to create a boxplot of Petal Length
使用该 boxplot 函数创建花瓣长度的箱线图

3. Type your conclusion of outliers of Petal.Length in your Rmd file
在你的 Rmd 文件中输入 Petal.Length 离群值的结果

4. Use par(mfrow = c(1, 2)) to combine the boxplot of Petal.Length and Petal.Width
使用 par(mfrow = c(1, 2)) 结合 Petal.LengthPetal.Width 到箱线

## # A Comprehensive Exerices 综合练习

1. Add the follwoing code block to InClassEx2.Rmd
将以下代码块添加到 InClassEx2.Rmd

2. Find the mean , variance , std of carBatteries

3. Find the mean , variance , std of carBatteries when carBatteries 3

4. Create a histogram of carBatteries
创建 carBatteries 的 histogram

5. Create a boxplot of carBatteries
创建 carBatteries 的 boxplot

# # Lab

## # 1. Hello World!

Here's an R code chunk that prints the text 'Hello world!'.

## # 2. Creating sequences

We just learned about the c() operator, which forms a vector from its arguments. If we're trying to build a vector containing a sequence of numbers, there are several useful functions at our disposal. These are the colon operator : and the sequence function seq() .

### #seq function: seq(from, to, by)

To learn more about a function, type ?functionname into your console. E.g., ?seq pulls up a Help file with the R documentation for the seq function.

## # 3. Cars data

We'll look at data frame and plotting in much more detail in later classes. For a previous of what's to come, here's a very basic example.

For this example we'll use a very simple dataset. The cars data comes with the default installation of R. To see the first few columns of the data, just type head(cars) .

We'll do a bad thing here and use the attach() command, which will allow us to access the speed and dist columns of cars as though they were vectors in our workspace.

### # (a) Calculate the average and standard deviation of speed and distance.

We can easily produce a histogram of stopping distance using the qplot function.

### # (b) Produce a histogram of stopping distance using the hist function with 10 bins.

The qplot(x,y,...) function can also be used to plot a vector y against a vector x . You can type ?qplot into the Console to learn more about the basic qplot function.