# # Part One

## # 1. Types of random variables

### # Instructions

*Review the definition of discrete and continuous random variables.*

A

**variable**is a quantity whose value changes.

变量是一个值会变化的量A

**discrete variable**is a variable whose value is obtained by counting.

离散变量是一个由计数获得其值的变量Examples:

- number of students present

出席的学生人数 - number of red marbles in a jar

一个罐子里红色弹珠的数量 - number of heads when flipping three coins

抛出三个硬币时正面的数量 - students’ grade level

学生的年级

- number of students present
A

**continuous variable**is a variable whose value is obtained by measuring.

连续变量是一个通过测量获得其值的变量Examples:

- height of students in class

班级学生的身高 - weight of students in class

班级学生的体重 - time it takes to get to school

到学校所需的时间 - distance traveled between classes

班级间的距离

- height of students in class
A

**random variable**is a variable whose value is a numerical outcome of a random phenomenon.

随机变量是一个值是随机现象的数值结果的变量。- A random variable is denoted with a capital letter

随机变量用大写字母表示 - The probability distribution of a random variable $X$ tells what the possible values of $X$ are and how probabilities are assigned to those values

随机变量$X$ 的概率分布表示$X$ 的可能值是什么，以及如何将概率分配给这些值 - A random variable can be discrete or continuous

随机变量可以是离散的或连续的

- A random variable is denoted with a capital letter
A

**discrete random variable**$X$ has a countable number of possible values.

离散随机变量$X$ 具有可数数量的可能值。- Example: Let $X$ represent the sum of two dice.

让$X$ 代表两个骰子的总和。 - To graph the probability distribution of a discrete random variable, construct a probability histogram.

要绘制离散随机变量的概率分布图，请构建概率直方图。

- Example: Let $X$ represent the sum of two dice.
A

**continuous random variable**$X$ takes all values in a given interval of numbers.

连续随机变量 $X$ 取给定数字区间内的所有值。- The probability distribution of a continuous random variable is shown by a
**density curve**.

连续随机变量的概率分布由密度曲线表示。 - The probability that $X$ is between an interval of numbers is the area under the density curve between the interval endpoints

$X$ 在数字区间之间的概率是区间端点之间密度曲线下的面积 - The probability that a
**continuous random variable**$X$ is exactly equal to a number is zero

连续随机变量$X$ 正好等于一个数字的概率为零

- The probability distribution of a continuous random variable is shown by a

### # Tasks

Classify the following random variables as discrete or continuous:

X: the number of automobile accidents per year in Virginia.

discrete

Y : the length of time to play 18 holes of golf.

continuous

M: the amount of milk produced yearly by a particular cow.

continuous

N: the number of eggs laid each month by a hen.

discrete

P: the number of building permits issued each month in a certain city.

discrete

Q: the weight of grain produced per acre.

continuous

## # 2. Choosing a measure of location to summarize the data

### # Instructions

*We have learned two ways to measure location or centrality of the data: the sample mean and the sample median. Review their definition and how to compute them (in R or Python, of course!).*

### # Task

A certain polymer is used for evacuation systems for aircraft. It is important that the polymer be resistant to the aging process.

某种聚合物用于飞机的疏散系统。重要的是聚合物能够抵抗老化过程。

Twenty specimens of the polymer were used in an experiment. Ten were assigned randomly to be exposed to an accelerated batch aging process that involved exposure to high temperatures for 10 days.

在一个实验中使用了 20 个聚合物样品。10 个被随机分配到暴露于高温下 10 天的加速批次老化过程中。

Measurements of tensile strength of the specimens were made, and the following data were recorded on tensile strength in psi:

对试样的拉伸强度进行了测量，并记录了以下以 psi 为单位的拉伸强度数据：

```
No aging: 227 222 218 216 218 217 225 229 228 221
Aging: 219 214 218 203 215 211 209 204 201 205
```

`# You can use the following code to create a data frame` | |

strength = c( 227 ,222, 218, 216, 218, 217, 225, 229, 228,221,219,214,218,203,215,211,209,204,201,205) | |

aging<-as.factor(c(rep(0,10),rep(1,10))) | |

polymerData<-data.frame(strength,aging) |

(a) Do a dot plot of the data. Hint: You can use

`qplot`

from`ggplot2`

to include two attributes`strength`

and`aging`

qplot(strength, aging)

(b) From your plot, does it appear as if the aging process has had an effect on the tensile strength of this polymer? Explain.

The degree of aging process has an effect on the tensile strength of the polymer. According to the distribution of the dot plot, the older the polymer, the worse its strength.

(c) Calculate the sample mean tensile strength of the two samples.

mean(polymerData[1:10,1])

mean(polymerData[11:20,1])

(d) Calculate the median for both. Discuss the similarity or lack of similarity between the mean and median of each group.

median(polymerData[1:10,1])

median(polymerData[11:20,1])

The mean and median of each group are relatively similar. It means the distribution is symmetric.

## # 3. Choosing a measure of variability to summarize the data

### # Instructions

*We have learned about two statistics that capture data variability: variance and standard deviation. Review their meaning and units, for the only differ in units.*

### # Task

The previous problem showed tensile strength data for two samples, one in which specimens were exposed to an aging process and one in which there was no aging of the specimens.

(a) Calculate the sample variance as well as standard deviation in tensile strength for both samples.

var(polymerData[1:10,1])

sd(polymerData[1:10,1])

var(polymerData[11:20,1])

sd(polymerData[11:20,1])

(b) Does there appear to be any evidence that aging affects the variability in tensile strength?

Yes. The sample variance of the aging group is greater, indicating that the tensile strength data is more variable.

# # Part Two: Working With Data

### # Instructions

*Read the following information if you need it before you begin:*

- Obtaining the wine quality dataset

### # Tasks

For the following exercises, work with the `winequality-red`

data set. Use either `Python`

or `R`

to solve each

problem.

Type a comment stating that you are working on a random data set we downloaded.

Locate the "Run" button and note whether there is a keyboard shortcut.

Execute the comment from the previous exercise. What is the output? Explain your answer.

Import the following packages:

a. For

`Python`

, import the`pandas`

and`numpy`

packages. Rename the`pandas`

package "`pd`

" and

rename the`numpy`

package "`np`

".

b. For`R`

, import the`ggplot2`

package. Make sure you both install and open the package.

Import the

`winequality-red`

data set and name it`winequalRed`

.`# here is a hint for the r version # -- change these commands as needed and delete these comments before submitting your work -- # if you downloaded the data set as a .csv file then you can read it in as follows: # winequalRed <- read.csv("~/Documents/datasets/winequality-red.csv", sep=";") # To view the data set # View(winequal_red)`

`# Import the winequality-red.csv`

winequalRed <- read.csv("winequality-red.csv", sep=";")

`# View(winequalRed)`

Create a table of the

`quality`

and`alcohol`

attributes from the`winequalRed`

data set. Do not save the output from the code.`# hint: if you have two data columns named X and Y in your data frame, you can use code like this to create a table: table(my.data.set$X, my.data.set$Y)`

table(winequalRed$quality, winequalRed$alcohol)

Save the first nine records of the

`winequalRed`

data set as their own data frame.firstNine <- head(winequalRed, 9)

firstNine

Save the

`density`

and`pH`

records of the`winequalRed`

data set as their own data frame.redDensity <- winequalRed$density

redPH <- winequalRed$pH

Separate the wine data into a low quality class (quality $\le 5$) and a high quality class (quality > 5), find the mean and standard deviation for two the attributes

`total.sulfur.dioxide`

and`alcohol`

for the two classes. Based on the statistical information, describe if there exists difference for these two attributes between the low quality and high quality red wines.lowQuality <- winequalRed[which(winequalRed$quality <= 5),]

highQuality <- winequalRed[which(winequalRed$quality > 5),]

mean(lowQuality$total.sulfur.dioxide)

mean(lowQuality$alcohol)

sd(lowQuality$total.sulfur.dioxide)

sd(lowQuality$alcohol)

mean(highQuality$total.sulfur.dioxide)

mean(highQuality$alcohol)

sd(highQuality$total.sulfur.dioxide)

sd(highQuality$alcohol)

To investigate the distribution of

`quality`

attribute, which plot you will use, boxplot or histogram? Show your result.Both histograms and box plots allow to visually assess the central tendency, the amount of variation in the data as well as the presence of gaps, outliers or unusual data points.

直方图和箱线图都允许直观地评估中心趋势、数据变化量以及存在间隙、离群值或异常数据点。

Histograms are preferred to determine the underlying probability distribution of a data. Box plots on the other hand are more useful when comparing between several data sets.

直方图更倾向于确定数据的基本概率分布。另一方面，在比较多个数据集时，箱线图更有用。

Although histograms are better in displaying the distribution of data, box plots can be used to tell if the distribution is symmetric or skewed.

虽然直方图在显示数据分布方面更好，但可以使用箱线图来判断分布是对称的还是偏斜的。boxplot(winequalRed$quality)

hist(winequalRed$quality, breaks = seq(3,8), labels=TRUE)

### # Extra Points

Without quitting R, load the

`winequality-white.csv`

file into the work space. Create a data frame by using the first 50 records of red wines and the first 50 records of white wines, and show a plot of`quality`

.Hint: You need to use

`merge`

function`# Import the winequality-white.csv`

winequalWhite <- read.csv("winequality-white.csv", sep=";")

`# View(winequalWhite)`

`# the first 50 records of red wines`

redWine <- head(winequalRed, 50)

`# the first 50 records of white wines`

whiteWine <- head(winequalWhite, 50)

`# merge data`

mergedWine <- merge(redWine, whiteWine, all = TRUE)

boxplot(mergedWine$quality)

hist(mergedWine$quality, breaks = seq(3,8), labels=TRUE)