# # Objectives 目标

• Know how to standardize the numeric variables
知道如何标准化数值变量
• Know how to choose the optimal visualization tool
to analyze the covariation of two variables
知道如何选择最佳的可视化工具来分析两个变量的协变

1. Chapter 7, R for Data Science by Garrett Grolemund and Hadley Wickham, available freely online. https://r4ds.had.co.nz/exploratory-data-analysis.html

2. Chapter 3 & 4, Data Science Using Python and R, Print ISBN:9781119526810 , Online ISBN:9781119526865. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119526865

# # Data Preparation 数据准备

## #scale Standardizing the Numeric Fields 标准化数字字段

Certain algorithms perform better when the numeric fields are standardized so the filed mean equals 0 and the field standard deviation equals 1, as follows:

$z = \frac{x-\bar{x}}{s}$

## # Identifying Outliers 识别异常值

A rough rule of thumb is that a data value is an outlier if its $z-$value is either greater than 3, or less than -3. Why?

[1] 0.001349898

[1] 0.001349898

[1] speed dist
<0 rows> (or 0-length row.names)


### # In-class Exercise 课堂练习

Standardize the field car$dist . Produce a listing of all the car items that are outliers at the high end of the scale. 规范 car$dist 。列出所有属于高端异常值的汽车项目。

# # Exploratory Data Analysis EDA 探索性数据分析

A task that how to use visualization and transformation to explore your data in a systematic way is called exploratory data analysis, or EDA for short.

• Use graphics to explore the relationship between the predictor variables and the target variable.
使用图形探索预测变量和目标变量之间的关系。

• Use graphics and tables to derive new variables that will increase predictive value
使用图形和表格推导出将增加预测值的新变量

Let's import the library first.

─ Attaching packages ──────── tidyverse 1.3.1 ─
✓ ggplot2 3.3.5     ✓ purrr   0.3.4
✓ tibble  3.1.4     ✓ dplyr   1.0.7
✓ tidyr   1.1.3     ✓ stringr 1.4.0
✓ readr   2.0.1     ✓ forcats 0.5.1
─ Conflicts ────────── tidyverse_conflicts() ─


More information about tibble at https://r4ds.had.co.nz/tibbles.html#tibbles.

Today we will play with a built-in data set diamonds in the package ggplot2 .

A tibble:6 × 10


     carat               cut        color        clarity
Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065
1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258
Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194
Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171
3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066
Max.   :5.0100                     I: 5422   VVS1   : 3655
J: 2808   (Other): 2531
depth           table           price             x
Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000
1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710
Median :61.80   Median :57.00   Median : 2401   Median : 5.700
Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731
3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540
Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740

y                z
Min.   : 0.000   Min.   : 0.000
1st Qu.: 4.720   1st Qu.: 2.910
Median : 5.710   Median : 3.530
Mean   : 5.735   Mean   : 3.539
3rd Qu.: 6.540   3rd Qu.: 4.040
Max.   :58.900   Max.   :31.800


## # Barplot Revisit 重温条形图

From the above information, we know cut is a categorical field.

We can use barplot to visualize the distribution of the categorical/discrete variable.

### #geom_bar() to construct a Barplot with Overlay

#### # In-class Exercise: geom_bar()

1. Please use barplot and geom_bar() to investigate clarity attribute in diamonds data set.
请使用 barplotgeom_bar() 调查数据集 diamonds 中的 clarity 属性。

2. Please use geom_bar() to investigate the relationship between clarity and cut .
请使用 geom_bar() 调查 claritycut 之间的关系。

### #geom_count() to visualize the covariation between categorical variables

The size of each circle in the plot displays how many observations occurred at each combination of values. Covariation will appear as a strong correlation between specific x values and specific y values.

Another approach is to compute the count with dplyr :

#### # In-class Exercise: Investigate Two Categorical Variables 研究两个分类变量

1. Please use geom_count() to investigate the relationship between clarity and cut .
请使用 geom_count() 调查 claritycut 之间的关系。

2. Please use geom_tile() to investigate the relationship between clarity and cut .
请使用 geom_tile() 调查 claritycut 之间的关系。

## # Histogram Revisit 重温直方图

From the above information, we know price is a continuous variable.

We can use hist to visualize the distribution of the continuous variable.

### #geom_histogram() to construct a Histogram with Overlay

#### # In-class Exercise: geom_histogram()

1. Please use hist and geom_histogram() to investigate depth attribute in diamonds data set.
请使用 histgeom_histogram() 调查数据集 diamonds 中的 depth 属性。

2. Please use geom_histogram() to investigate the relationship between depth and cut .
请使用 geom_histogram() 调查 depthcut 之间的关系。

### #geom_freqpoly() to show the covariation of two variables

The default appearance of geom_freqpoly() is not that useful for that sort of comparison because the height is given by the count.
geom_freqpoly() 的默认外观对于这种比较没有多大用处，因为高度是由计数给出的。

That means if one of the groups is much smaller than the others, it’s hard to see the differences in shape.

For example, let’s explore how the price of a diamond varies with its quality:

#### # In-class Exercise: Investigate A Categorical and Continuous Variable 研究分类和连续变量

1. Please use geom_freqpoly() to investigate the relationship between depth and cut .

## # Boxplot Revisit 重温箱线图

Another alternative to display the distribution of a continuous variable broken down by a categorical variable is the boxplot.

### #geom_boxplot() to investigate A Categorical and Continuous Variable

We see much less information about the distribution, but the boxplots are much more compact so we can more easily compare them (and fit more on one plot).

It supports the counterintuitive finding that better quality diamonds are cheaper on average!

To make the trend easier to see, we can reorder cut based on the median value of price :

If you have long variable names, geom_boxplot() will work better if you flip it 90°.

You can do that with coord_flip() .

#### # In-class Exercise: Investigate A Categorical and Continuous Variable by geom_boxplot

1. Please use geom_boxplot() to investigate the relationship between depth and cut . You can choose to filp the boxplot or not.
请使用 geom_boxplot() 调查 depthcut 之间的关系。可以选择是否翻转箱线图。

### #geom_point() for Two Continuous Variables

We can use a regular plot to do that.

Draw a scatterplot with geom_point() .
geom_point() 绘制散点图

Let's play with iris dataset.

What about ggplot ?

#### # In-class Exercise: Two Continuous Variables by geom_point()

1. Please use geom_point() to investigate the relationship between depth and price .
请使用 geom_point() 调查 depthprice 之间的关系。

2. Please use geom_point() to investigate the relationship between Petal.Length and Sepal.Length with Speices in Iris dataset.
请使用 geom_point()Iris 数据集调查 Petal.LengthSepal.LengthSpeices 之间的关系的。

### #geom_bin2d() and geom_hex() for Two Continuous Variables

Another solution is to use bin.

Previously you used geom_histogram and geom_freqpoly to bin in one dimension.

Now you’ll learn how to use geom_bin2d and geom_hex to bin in two dimensions.

#### # In-class Exercise

1. Please use geom_bin2d() to investigate the relationship between depth and price .
请使用 geom_bin2d() 调查 depthprice 之间的关系。

2. Please use geom_hex() to investigate the relationship between depth and price .
请使用 geom_hex() 调查 depthprice 之间的关系。

### #geom_boxplot() for Two Continuous Variables

We can also bin one continuous variable so it acts like a categorical variable.

Then you can use one of the techniques for visualizing the combination of a categorical and a continuous variable that you learned about.

For example, you could bin carat and then for each group, display a boxplot:

#### # In-class Exercise: Two Continuous Variables by geom_boxplot()

1. Please use geom_boxplot() to investigate the relationship between depth and price . How to choose the width?
请使用 geom_boxplot() 调查 depthprice 之间的关系。如何选择宽度？

2. Please use geom_boxplot() to investigate the relationship between Sepal.Length and Sepal.Width . How to choose the width?
请使用 geom_boxplot() 调查 Sepal.LengthSepal.Width 之间的关系。如何选择宽度？

# # References

1. R for Data Science by Garrett Grolemund and Hadley Wickham, available freely online. https://r4ds.had.co.nz/fexploratory-data-analysis.html
2. Data Science Using Python and R, Print ISBN:9781119526810 , Online ISBN:9781119526865. https://onlinelibrary.wiley.com/doi/book/10.1002/9781119526865