# Part One

# 1. Types of random variables

# Instructions

Review the definition of discrete and continuous random variables.

  • A variable is a quantity whose value changes.
    变量是一个值会变化的量

  • A discrete variable is a variable whose value is obtained by counting.
    离散变量是一个由计数获得其值的变量

    Examples:

    • number of students present
      出席的学生人数
    • number of red marbles in a jar
      一个罐子里红色弹珠的数量
    • number of heads when flipping three coins
      抛出三个硬币时正面的数量
    • students’ grade level
      学生的年级
  • A continuous variable is a variable whose value is obtained by measuring.
    连续变量是一个通过测量获得其值的变量

    Examples:

    • height of students in class
      班级学生的身高
    • weight of students in class
      班级学生的体重
    • time it takes to get to school
      到学校所需的时间
    • distance traveled between classes
      班级间的距离
  • A random variable is a variable whose value is a numerical outcome of a random phenomenon.
    随机变量是一个值是随机现象的数值结果的变量。

    • A random variable is denoted with a capital letter
      随机变量用大写字母表示
    • The probability distribution of a random variable XX tells what the possible values of XX are and how probabilities are assigned to those values
      随机变量XX 的概率分布表示XX 的可能值是什么,以及如何将概率分配给这些值
    • A random variable can be discrete or continuous
      随机变量可以是离散的或连续的
  • A discrete random variable XX has a countable number of possible values.
    离散随机变量XX 具有可数数量的可能值。

    • Example: Let XX represent the sum of two dice.
      XX 代表两个骰子的总和。
    • To graph the probability distribution of a discrete random variable, construct a probability histogram.
      要绘制离散随机变量的概率分布图,请构建概率直方图。
  • A continuous random variable XX takes all values in a given interval of numbers.
    连续随机变量 XX 取给定数字区间内的所有值。

    • The probability distribution of a continuous random variable is shown by a density curve.
      连续随机变量的概率分布由密度曲线表示。
    • The probability that XX is between an interval of numbers is the area under the density curve between the interval endpoints
      XX 在数字区间之间的概率是区间端点之间密度曲线下的面积
    • The probability that a continuous random variable XX is exactly equal to a number is zero
      连续随机变量XX 正好等于一个数字的概率为零

# Tasks

Classify the following random variables as discrete or continuous:

  1. X: the number of automobile accidents per year in Virginia.

    discrete

  2. Y : the length of time to play 18 holes of golf.

    continuous

  3. M: the amount of milk produced yearly by a particular cow.

    continuous

  4. N: the number of eggs laid each month by a hen.

    discrete

  5. P: the number of building permits issued each month in a certain city.

    discrete

  6. Q: the weight of grain produced per acre.

    continuous

# 2. Choosing a measure of location to summarize the data

# Instructions

We have learned two ways to measure location or centrality of the data: the sample mean and the sample median. Review their definition and how to compute them (in R or Python, of course!).

# Task

A certain polymer is used for evacuation systems for aircraft. It is important that the polymer be resistant to the aging process.
某种聚合物用于飞机的疏散系统。重要的是聚合物能够抵抗老化过程。

Twenty specimens of the polymer were used in an experiment. Ten were assigned randomly to be exposed to an accelerated batch aging process that involved exposure to high temperatures for 10 days.
在一个实验中使用了 20 个聚合物样品。10 个被随机分配到暴露于高温下 10 天的加速批次老化过程中。

Measurements of tensile strength of the specimens were made, and the following data were recorded on tensile strength in psi:
对试样的拉伸强度进行了测量,并记录了以下以 psi 为单位的拉伸强度数据:

No aging: 227 222 218 216 218 217 225 229 228 221
Aging: 219 214 218 203 215 211 209 204 201 205 
# You can use the following code to create a data frame
strength = c( 227 ,222, 218, 216, 218, 217, 225, 229, 228,221,219,214,218,203,215,211,209,204,201,205)
aging<-as.factor(c(rep(0,10),rep(1,10)))
polymerData<-data.frame(strength,aging)
  1. (a) Do a dot plot of the data. Hint: You can use qplot from ggplot2 to include two attributes strength and aging

    qplot(strength, aging)
  2. (b) From your plot, does it appear as if the aging process has had an effect on the tensile strength of this polymer? Explain.

    The degree of aging process has an effect on the tensile strength of the polymer. According to the distribution of the dot plot, the older the polymer, the worse its strength.

  3. (c) Calculate the sample mean tensile strength of the two samples.

    mean(polymerData[1:10,1])
    mean(polymerData[11:20,1])
  4. (d) Calculate the median for both. Discuss the similarity or lack of similarity between the mean and median of each group.

    median(polymerData[1:10,1])
    median(polymerData[11:20,1])

    The mean and median of each group are relatively similar. It means the distribution is symmetric.

# 3. Choosing a measure of variability to summarize the data

# Instructions

We have learned about two statistics that capture data variability: variance and standard deviation. Review their meaning and units, for the only differ in units.

# Task

The previous problem showed tensile strength data for two samples, one in which specimens were exposed to an aging process and one in which there was no aging of the specimens.

  1. (a) Calculate the sample variance as well as standard deviation in tensile strength for both samples.

    var(polymerData[1:10,1])
    sd(polymerData[1:10,1])
    var(polymerData[11:20,1])
    sd(polymerData[11:20,1])
  2. (b) Does there appear to be any evidence that aging affects the variability in tensile strength?

    Yes. The sample variance of the aging group is greater, indicating that the tensile strength data is more variable.

# Part Two: Working With Data

# Instructions

Read the following information if you need it before you begin:

  • Obtaining the wine quality dataset

# Tasks

For the following exercises, work with the winequality-red data set. Use either Python or R to solve each
problem.

  • Type a comment stating that you are working on a random data set we downloaded.

  • Locate the "Run" button and note whether there is a keyboard shortcut.

  • Execute the comment from the previous exercise. What is the output? Explain your answer.

  • Import the following packages:

    a. For Python , import the pandas and numpy packages. Rename the pandas package " pd " and
    rename the numpy package " np ".
    b. For R , import the ggplot2 package. Make sure you both install and open the package.

  1. Import the winequality-red data set and name it winequalRed .

    # here is a hint for the r version 
    # -- change these commands as needed and delete these comments before submitting your work -- 
    # if you downloaded the data set as a .csv file then you can read it in as follows: 
    # winequalRed <- read.csv("~/Documents/datasets/winequality-red.csv", sep=";")
    # To view the data set
    #   View(winequal_red)
    
    # Import the winequality-red.csv
    winequalRed <- read.csv("winequality-red.csv", sep=";")
    # View(winequalRed)
  2. Create a table of the quality and alcohol attributes from the winequalRed data set. Do not save the output from the code.

    # hint: if you have two data columns named X and Y in your data frame, you can use code like this to create  a table: 
    table(my.data.set$X, my.data.set$Y)
    
    table(winequalRed$quality, winequalRed$alcohol)
  3. Save the first nine records of the winequalRed data set as their own data frame.

    firstNine <- head(winequalRed, 9)
    firstNine
  4. Save the density and pH records of the winequalRed data set as their own data frame.

    redDensity <-  winequalRed$density
    redPH <- winequalRed$pH
  5. Separate the wine data into a low quality class (quality 5\le 5) and a high quality class (quality > 5), find the mean and standard deviation for two the attributes total.sulfur.dioxide and alcohol for the two classes. Based on the statistical information, describe if there exists difference for these two attributes between the low quality and high quality red wines.

    lowQuality <- winequalRed[which(winequalRed$quality <= 5),]
    highQuality <- winequalRed[which(winequalRed$quality > 5),]
    mean(lowQuality$total.sulfur.dioxide)
    mean(lowQuality$alcohol)
    sd(lowQuality$total.sulfur.dioxide)
    sd(lowQuality$alcohol)
    mean(highQuality$total.sulfur.dioxide)
    mean(highQuality$alcohol)
    sd(highQuality$total.sulfur.dioxide)
    sd(highQuality$alcohol)
  6. To investigate the distribution of quality attribute, which plot you will use, boxplot or histogram? Show your result.

    Both histograms and box plots allow to visually assess the central tendency, the amount of variation in the data as well as the presence of gaps, outliers or unusual data points.
    直方图和箱线图都允许直观地评估中心趋势、数据变化量以及存在间隙、离群值或异常数据点。
    Histograms are preferred to determine the underlying probability distribution of a data. Box plots on the other hand are more useful when comparing between several data sets.
    直方图更倾向于确定数据的基本概率分布。另一方面,在比较多个数据集时,箱线图更有用。
    Although histograms are better in displaying the distribution of data, box plots can be used to tell if the distribution is symmetric or skewed.
    虽然直方图在显示数据分布方面更好,但可以使用箱线图来判断分布是对称的还是偏斜的。

    boxplot(winequalRed$quality) 
    hist(winequalRed$quality, breaks = seq(3,8), labels=TRUE)

# Extra Points

  1. Without quitting R, load the winequality-white.csv file into the work space. Create a data frame by using the first 50 records of red wines and the first 50 records of white wines, and show a plot of quality .

    Hint: You need to use merge function

    # Import the winequality-white.csv
    winequalWhite <- read.csv("winequality-white.csv", sep=";")
    # View(winequalWhite)
    # the first 50 records of red wines
    redWine <- head(winequalRed, 50)
    # the first 50 records of white wines
    whiteWine <- head(winequalWhite, 50)
    # merge data
    mergedWine <- merge(redWine, whiteWine, all = TRUE)
    boxplot(mergedWine$quality) 
    hist(mergedWine$quality, breaks = seq(3,8), labels=TRUE)