# Objectives 目标

  • Understand the difference bewteen Population and Sample
    了解人群样本之间的差异
  • Know how to compute Sample Mean , Sample Median, and Sample Variance; How to interpret these measures
    知道如何计算样本均值样本中值样本方差;如何解释这些方法
  • Know how to creat a histogram and boxplot in R; Understand the interpretation of these two plots
    知道如何在 R 中创建直方图箱线图;理解这两个情节的解释

# Introduction to Statistics 统计学概论

What is statistics? 什么是统计?
Statistics is the study of the collection, organization, analysis, and interpretation of data.
统计学是对数据的收集、组织、分析和解释的研究。

# Basics of statistics 统计学基础

Population 人群
the entire group of individuals that we want information about.
我们想要了解的整个个人群体。
Sample 样本
a part of the population that we actually examine in order to gather information about the whole population.
我们实际检查的一部分人口,以收集有关整个人口的信息。

# Types of statistics 统计类型

Descriptive statistics 描述性统计
utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present the information in a convenient form.
利用数值和图形方法在数据集中寻找模式,总结数据集中揭示的信息,并以方便的形式呈现信息。
Inferential statistics 推论统计
use a fact about a sample to estimate the truth about the whole population.
使用关于样本的事实来估计关于整个总体的真相。

# Descriptive Statistics 描述性统计

General two types of data:
一般有两种类型的数据:

Qualitative data 定性数据
observations that cannot be measured on a numerical scale. They can only be classified into one of a group of categories.
无法在数字尺度上测量的观察结果。它们只能归入一组类别中的一个。
Example: species of fish, eye color, marital status etc.
例如:鱼的种类、眼睛颜色、婚姻状况等。
Quantitative data 定量数据
measurements that are recorded on a naturally occurring numerical scale.
以自然发生的数字标度记录的测量值。
Example: height of person, score of test, etc.
例如:人的身高、考试成绩等。

# Numerical Methods 数值方法

  • Consider a quantitative dataset with nn observations, denoted as {x1,,xn}\{x_1,\cdots,x_n\}.
    定义一个定量数据集n={x1,,xn}n = \{ x_1, \cdots ,x_n \}

# Location Measures 位置测量

Sample Mean 样本均值
arithmetic average, denoted by xˉ\bar{x}
算术平均值,用 xˉ\bar{x} 表示。
The corresponding population parameter is population mean, denoted by μ\mu.
对应的总体参数为总体均值,表示为μ\mu

xˉ=x1++xnn=1ni=1nxi=1nxi\bar{x} = \frac{x_1+\cdots+x_n}{n} = \frac{1}{n}\sum\limits_{i=1}^n x_i=\frac{1}{n}\sum x_i

Sample median 样本中位数
middle number when the observations are arranged in ascending order, noted by x~\tilde{x}.
当观测值按升序排列时的中间数,记为 x~\tilde{x}

x~={x(n+1)/2,ifnis odd,12(xn/2+xn/2+1),ifnis even.\tilde{x} = \begin{cases} x_{(n+1)/2}, & \text{ if } n \text{ is odd,} \\ \frac{1}{2}(x_{n/2} + x_{n/2+1}), & \text{ if } n \text{ is even.} \end{cases}

Median is less sensitive than mean to extremely large or small observations, which is good.
中值对极大或极小的观察值不如均值敏感。
For example, a dataset {3,0,4}\{-3,0,4\}, the sample mean is 1/31/3 and the sample median is 00.
例如,一个数据集 {3,0,4}\{-3,0,4\}样本均值1/31/3样本中位数00.
If the observation 4 is changed to 18, then the sample mean becomes 55, while the sample median stays unchanged as 00.
如果将值 4 更改为 18,则样本均值变为55,而样本中位数保持不变仍为 00.

# Variability Measures 变异性度量

Sample range 样本范围
measure of variability, xmaxxminx_{\max} -x_{\min}
可变性的度量
Sample variance 样本方差
measure of variability, spread out, denoted by s2s^2
变异性的度量,分布,用 s2s^2 表示
The corresponding population parameter is population variance, denoted as σ2\sigma^2.
对应的总体参数为总体方差,记为 σ2\sigma^2

s2=(x1xˉ)2++(xnxˉ)2n1=i=1n(xixˉ)2n1s^2 = \frac{(x_1-\bar{x})^2+\cdots+(x_n-\bar{x})^2}{n-1} = \frac{\sum\limits_{i=1}^n (x_i-\bar{x})^2}{n-1}

Sample standard deviation 样本标准差
s=s2s = \sqrt{s^2}, common way for measuring how far observations are away from the mean.
测量观测值与均值相差多远的常用方法。
The corresponding population parameter is the population standard deviation, denoted as σ\sigma.
对应的总体参数为总体标准差,记为 σ\sigma.

# 使用 R 来计算

  • Consider a tiny dataset with three observations 0, 1 and 5. Find the sample mean, sample median, sample variance, and sample standard deviation.
    考虑一个包含三个观测值 0、1 和 5 的小数据集。找出样本均值、样本中值、样本方差和样本标准差。

    xˉ=(0+1+5)/3=2\bar{x} = (0+1+5)/3 = 2

    x~=1\tilde{x} = 1

    s2=131[(02)2+(12)2+(51)2]=4+1+931=7s^2 = \frac{1}{3-1}\left[(0-2)^2+(1-2)^2+(5-1)^2\right] = \frac{4+1+9}{3-1} = 7

    s=7s = \sqrt{7}

  • How to use R find the sample mean, sample median, and sample variance?
    如何使用 R 求样本均值、样本中位数和样本方差?

    Let's check a very small dataset iris , which comes with the default installation of R. To check the sample mean and sample median of the data, just type summary(iris) .
    检查一个非常小的数据集 iris ,它随 R 的默认安装一起提供。要检查数据的样本均值和样本中值,只需键入 summary(iris)

    summary
    # Generate mean, median, percentile for numeric attributes and frequency for categorical attributes
    summary(iris)
    Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
    Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
    1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
    Median :5.800   Median :3.000   Median :4.350   Median :1.300  
    Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
    3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
    Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
        Species  
    setosa    :50  
    versicolor:50  
    virginica :50  
    
  • Use var to find the variance of one attribute Petal.Width .
    使用 var 找到属性 Petal.Width 的方差。

    variance
    # Generate variance for numeric attributes
    var(iris$Petal.Width)
    [1] 0.5810063
    

# Graphical Methods 图形方法

# Frequency table 频率表

The following is a table, which specifies the life of 40 similar car batteries recorded to the nearest tenth of a year. The batteries are guaranteed to last 3 years.
下表列出了 40 个类似汽车电池的寿命,记录到最接近的十分之一年。电池保证可以使用 3 年。

# Histogram 直方图

  • Graphically displays the contents in the frequency table
    以图形方式显示频率表中的内容

    • The class intervals in the frequency table form the scale of the horizontal axis.
      频率表中的类间隔形成了横轴的刻度。

    • A vertical bar is placed over each class interval, with height equal to either the class frequency or class relative frequency.
      在每个班级间隔上放置一个垂直条,其高度等于班级频率或班级相对频率。

    • Used to plot the density of the data
      用于绘制数据的密度

  • How to plot histogram in R? Use the function hist . The following is a histogram of Petal.Width in iris dataset.
    如何在 R 中绘制直方图?使用函数 hist 。下面是直方图 Petal.Widthiris 数据集。

    irisHisto
    #Histogram of Petal Width with 10 bins
    hist(iris$Petal.Width, breaks = seq(0,2.5,l=11),col = "blue", main = "Histogram of Petal Width",xlab="Petal Width")

    irisHisto
    #Histogram of Petal Width with 20 bins
    hist(iris$Petal.Width, breaks = seq(0,2.5,l=21),col = "blue", main = "Histogram of Petal Width",xlab="Petal Width")

# Boxplot 箱形图

  • Graphically depicting groups of numerical data through their quartiles.
    通过四分位数以图形方式描述数字数据组。

    first quartile Q1/25th Percentile 第一个四分位数
    the middle number between the smallest number (not the “minimum”) and the median of the dataset.
    最小数字(不是 “最小值”)和数据集的中位数之间的中间数字。
    third quartile Q3/75th Percentile 第三四分位数
    the middle value between the median and the highest value (not the “maximum”) of the dataset.
    数据集的中位数和最大值(不是 “最大值”)之间的中间值。
    interquartile range IQR 四分位距
    25th to the 75th percentile.
    (IQR):第 25 到第 75 个百分位数。
    “maximum”
    Q3 + 1.5*IQR
    “minimum”
    Q1 -1.5*IQR
  • Use the boxplot function to create a boxplot of Petal Width.
    使用 boxplot 函数创建花瓣宽度的箱线图。

    boxplot(iris$Petal.Width, horizontal = TRUE)

  • par(mfrow = c(1, 2)) creates a grid of size 1x2 for plots; it divides the plot area into a grid so you see several plots on the same page as opposed to separately. Try changing the 1 and the 2 to something else!
    par(mfrow = c(1, 2)) 为绘图创建一个大小为 1x2 的网格;它将绘图区域划分为一个网格,因此你可以在同一页面上看到多个绘图,而不是单独显示。尝试将 1 和 2 更改为其他内容!

    par(mfrow = c(1, 2))
    boxplot(iris$Sepal.Length)
    boxplot(iris$Sepal.Width)
    abline(h = min(iris$Sepal.Width), col = "Blue")
    abline(h = max(iris$Sepal.Width), col = "Yellow")
    abline(h = median(iris$Sepal.Width), col = "Green")
    abline(h = quantile(iris$Sepal.Width, c(0.25, 0.75)), col = "Red")

# In-class exercise 课堂练习

# Variance 方差

  1. Open RStudio on your machine
  2. File > New File > R Markdown ...
  3. Modify summary(cars) in the first code block to find the variance of Sepal.Length
    在第一个代码块中修改 summary(cars) 以找到 Sepal.Length 的方差
  4. Click Knit HTML to produce an HTML file.
  5. Save your Rmd file as InClassEx2.Rmd

# Histogram 直方图

  1. Keep working on your Rmd file InClassEx2.Rmd
    继续处理你的 Rmd 文件 InClassEx2.Rmd
  2. Use the hist function to create a histogram of Petal Length.
    使用该 hist 函数创建花瓣长度的直方图。

# Boxplot 箱线图

  1. Keep working on your Rmd file InClassEx2.Rmd

  2. Use the boxplot function to create a boxplot of Petal Length
    使用该 boxplot 函数创建花瓣长度的箱线图

  3. Type your conclusion of outliers of Petal.Length in your Rmd file
    在你的 Rmd 文件中输入 Petal.Length 离群值的结果

  4. Use par(mfrow = c(1, 2)) to combine the boxplot of Petal.Length and Petal.Width
    使用 par(mfrow = c(1, 2)) 结合 Petal.LengthPetal.Width 到箱线

# A Comprehensive Exerices 综合练习

  1. Add the follwoing code block to InClassEx2.Rmd
    将以下代码块添加到 InClassEx2.Rmd

    carBatteries <- c(
    2.2,4.1,3.5, 4.5, 3.2, 3.7, 3.0, 2.6, 3.4, 1.6, 3.1, 3.3, 3.8, 3.1, 4.7, 3.7, 2.5, 
    4.3, 3.4, 3.6, 2.9, 3.3, 3.9, 3.1, 3.3, 3.1, 3.7, 4.4, 3.2, 4.1, 1.9, 3.4, 4.7, 
    3.8, 3.2, 2.6, 3.9, 3.0, 4.2, 3.5
    )
  2. Find the mean , variance , std of carBatteries

    mean(carBatteries)
    var(carBatteries)
    sd(carBatteries)
  3. Find the mean , variance , std of carBatteries when carBatteries 3

    carBattGT3 <- carBatteries[carBatteries>3]
    mean(carBattGT3)
    var(carBattGT3)
    sd(carBattGT3)
  4. Create a histogram of carBatteries
    创建 carBatteries 的 histogram

    hist(carBatteries)
  5. Create a boxplot of carBatteries
    创建 carBatteries 的 boxplot

# Lab

# 1. Hello World!

Here's an R code chunk that prints the text 'Hello world!'.

print("Hello world!")

# (a) Modify the code chunk below to print your name

print("Mayuri Mizuki")

# 2. Creating sequences

We just learned about the c() operator, which forms a vector from its arguments. If we're trying to build a vector containing a sequence of numbers, there are several useful functions at our disposal. These are the colon operator : and the sequence function seq() .

# : Colon operator:

1:10 # Numbers 1 to 10
127:132 # Numbers 127 to 132

# seq function: seq(from, to, by)

seq(1,10,1) # Numbers 1 to 10
seq(1,10,2) # Odd numbers from 1 to 10
seq(2,10,2) # Even numbers from 2 to 10

To learn more about a function, type ?functionname into your console. E.g., ?seq pulls up a Help file with the R documentation for the seq function.

# (a) Use : to output the sequence of numbers from 3 to 12

3:12

# (b) Use seq() to output the sequence of numbers from 3 to 30 in increments of 3

seq(3,30,3)

# (c) Save the sequence from (a) as a variable x , and the sequence from (b) as a variable y . Output their product x*y

x <- 3:12
y <- seq(3,30,3)
x*y

# 3. Cars data

We'll look at data frame and plotting in much more detail in later classes. For a previous of what's to come, here's a very basic example.

For this example we'll use a very simple dataset. The cars data comes with the default installation of R. To see the first few columns of the data, just type head(cars) .

head(cars)

We'll do a bad thing here and use the attach() command, which will allow us to access the speed and dist columns of cars as though they were vectors in our workspace.

attach(cars) # Using this command is poor style.  We will avoid it in the future.
speed
dist

# (a) Calculate the average and standard deviation of speed and distance.

mean(speed) # average of speed
mean(dist) # average of distance
sd(speed) # standard deviation of speed
sd(dist) # standard deviation of distance

We can easily produce a histogram of stopping distance using the qplot function.

qplot(dist, bins=40) # Histogram of stopping distance

# (b) Produce a histogram of stopping distance using the hist function with 10 bins.

hist(dist, breaks = seq(min(dist), max(dist), l=11),col = "pink", main = "Histogram of Stopping Distance",xlab="Distance")

The qplot(x,y,...) function can also be used to plot a vector y against a vector x . You can type ?qplot into the Console to learn more about the basic qplot function.

# (c) Use the qplot(x,y) function to create a scatterplot of dist against speed.

qplot(speed, dist)

# (d) Use the boxplot function to create a boxplot of speed.

boxplot(speed, horizontal = TRUE)