# # Outline

• Continuous and Continuous Variables
连续变量和连续变量
• Pearson's correlation coefficient
皮尔逊相关系数
• Categorical and Continuous Variables
分类变量和连续变量
• ANOVA Test
方差分析检验
• Categorical and Categorical Variables
分类变量和分类变量
• Chi-squared Test
卡方检验

# # Correlation between Numerical Variables 数值变量之间的相关性

To investigate the correlation, we can use pairs function.

How to interpret the scatter plot?

To get the Pearson correlation coefficients r using cor function

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

• r<0.3 , weak correlation 弱相关性

• 0.3<r<0.7 , moderate correlation 中等相关性

• r>0.7 , high correlation 高相关性

# # Association between One Numerical Variable and One Categorical Variable 一个数值变量和一个分类变量之间的关联

We can do a overlay boxplot first.

Let's play with iris data set.

We can use ANOVA test to check the association between one numerical variable and one categorical variable with aov function.

ANOVA ( AOV ) is short for ANalysis Of VAriance.

             Df Sum Sq Mean Sq F value Pr(>F)
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


## # Interpretation 解读

• The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
Df 列显示自变量的自由度 (变量中的级别数减去 1)，以及残差的自由度 (观察总数减去 1 和减去自变量中的级别数)。

• The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
Sum Sq 列显示平方和 (也就是组均值和总体均值之间的总变异)。

• The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
Mean Sq 列是平方和的平均值，计算方法是将平方和除以每个参数的自由度。

• The F-value column is the test statistic from the F test.
F-Value 列是来自 F 检验的测试统计数据。
This is the mean square of each independent variable divided by the mean square of the residuals.
这是每个自变量的均方除以残差的均方。
The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
F 值越大，由自变量引起的变化就越有可能是真实的，而不是偶然的。

• The Pr(>F) column is the p-value of the F-statistic.
Pr (>F) 列是 F 统计量的 p 值。
This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
这表明，如果分组平均值之间没有差异的零假设为真，那么从测试中计算出的 F 值发生的可能性有多大。

Here, the p-value of the Species variable is very low (p < 0.001), so it appears that the type of Species used has a real impact on the Sepal.Length .

# # Associations between Categorical Variables 分类变量之间的关联

## # Chi-squared Test 卡方检验

We can use Chi-squared Test to determine if a population has a specified theoretical distribution.

We can also use Chi-squared Test to check the independence of two categorical features/attributes.

We need the contingency table one more time.

We want to investigate the independence of rank and sex in Salaries data set.

Here null hypothesis : these two variables are associated;

alternative hypothesis: these two variables are NOT associated.

        rank     discipline yrs.since.phd    yrs.service        sex
AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39
AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358
Prof     :266              Median :21.00   Median :16.00
Mean   :22.31   Mean   :17.61
3rd Qu.:32.00   3rd Qu.:27.00
Max.   :56.00   Max.   :60.00
salary
Min.   : 57800
1st Qu.: 91000
Median :107300
Mean   :113706
3rd Qu.:134185
Max.   :231545

            Female Male
AsstProf      11   56
AssocProf     10   54
Prof          18  248

    Pearson's Chi-squared test

data:  contTable
X-squared = 8.5259, df = 2, p-value = 0.01408


Since we get a p-value of less than the significance level of 0.05
, we can reject the null hypothesis and conclude that the two variables are, indeed, independent.

### # A problem with Pearson’s $\chi^2$ 皮尔逊$\chi^2$ 的问题

Coefficient is that the range of its maximum value depends on the sample size and the size of the contingency table.

These values may vary in different situations.

To overcome this problem, the coefficient can be standardized to lie between 0 and 1 so that it is independent of the sample size as well as the dimension of the contingency table.

## # Cramer's V (phi) Coefficient 克莱姆系数

Suppose we have a $r \times c$ contingency table, Cramer’s V as follow:

$V = \sqrt{\frac{\chi^2}{n \min(r-1,c-1)}},$

where $\chi^2$ is the Chi-squared statistic, $n$ is the sample size, $r$ is the number of rows, and $c$ is the number of columns.
$\chi^2$ 是卡方统计量，$n$ 是样本量，$r$ 是行数，$c$ 是列数。

From the previous example we have

X-squared
0.1465466


We can also use the function cramerV in package rcompanion to calculate Cramer's V value.

Cramer V
0.1465


The range of Cramer's V value is from 0 to 1.
Cramer’s V 的取值范围是 0 到 1。

The value we got here is very small.

We can conclude there is no significant association between rank and sex .

# # Reference

1. Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall

2. Correlation between discrete (categorical) variables, https://rpubs.com/hoanganhngo610/558925.

3. Understanding ANOVA in R, https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html