# Outline

  • Continuous and Continuous Variables
    • Pearson's correlation coefficient
  • Categorical and Continuous Variables
    • ANOVA Test
  • Categorical and Categorical Variables
    • Chi-squared Test

# Correlation between Numerical Variables 数值变量之间的相关性

To investigate the correlation, we can use pairs function.
为了研究相关性,我们可以使用 pairs “配对” 函数。

pairs(iris[,1:4], col = "blue")

# To make it a little fancier
super.sym <- trellis.par.get("superpose.symbol")
splom(~iris[1:4], groups = Species, data = iris,
      panel = panel.superpose,
      key = list(title = "Three Varieties of Iris",
                 columns = 3, 
                 points = list(pch = super.sym$pch[1:3],
                 col = super.sym$col[1:3]),
                 text = list(c("Setosa", "Versicolor", "Virginica"))))

How to interpret the scatter plot?

To get the Pearson correlation coefficients r using cor function
使用 cor 函数获得皮尔逊相关系数 r

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
  • r<0.3 , weak correlation 弱相关性

  • 0.3<r<0.7 , moderate correlation 中等相关性

  • r>0.7 , high correlation 高相关性

# Association between One Numerical Variable and One Categorical Variable 一个数值变量和一个分类变量之间的关联

We can do a overlay boxplot first.

Let's play with iris data set.

ggplot(iris, aes(x = Species , y = Sepal.Length )) + geom_boxplot()

We can use ANOVA test to check the association between one numerical variable and one categorical variable with aov function.
利用 aov 函数,可以用方差分析检验一个数值变量和一个分类变量之间的关联性。

ANOVA ( AOV ) is short for ANalysis Of VAriance.

aov1 <- aov(Sepal.Length ~ Species, data = iris)
             Df Sum Sq Mean Sq F value Pr(>F)    
Species       2  63.21  31.606   119.3 <2e-16 ***
Residuals   147  38.96   0.265                   
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Interpretation 解读

  • The Df column displays the degrees of freedom for the independent variable (the number of levels in the variable minus 1), and the degrees of freedom for the residuals (the total number of observations minus one and minus the number of levels in the independent variables).
    Df 列显示自变量的自由度 (变量中的级别数减去 1),以及残差的自由度 (观察总数减去 1 和减去自变量中的级别数)。

  • The Sum Sq column displays the sum of squares (a.k.a. the total variation between the group means and the overall mean).
    Sum Sq 列显示平方和 (也就是组均值和总体均值之间的总变异)。

  • The Mean Sq column is the mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter.
    Mean Sq 列是平方和的平均值,计算方法是将平方和除以每个参数的自由度。

  • The F-value column is the test statistic from the F test.
    F-Value 列是来自 F 检验的测试统计数据。
    This is the mean square of each independent variable divided by the mean square of the residuals.
    The larger the F value, the more likely it is that the variation caused by the independent variable is real and not due to chance.
    F 值越大,由自变量引起的变化就越有可能是真实的,而不是偶然的。

  • The Pr(>F) column is the p-value of the F-statistic.
    Pr (>F) 列是 F 统计量的 p 值。
    This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.
    这表明,如果分组平均值之间没有差异的零假设为真,那么从测试中计算出的 F 值发生的可能性有多大。

Here, the p-value of the Species variable is very low (p < 0.001), so it appears that the type of Species used has a real impact on the Sepal.Length .
这里, Species 变量的 p 值非常低 (p < 0.001),所以看起来使用的 Species 类型对 Sepal.Length 有真正的影响。

# Associations between Categorical Variables 分类变量之间的关联

# Chi-squared Test 卡方检验

We can use Chi-squared Test to determine if a population has a specified theoretical distribution.

We can also use Chi-squared Test to check the independence of two categorical features/attributes.
我们也可以使用 Chi-squared Test 来检查两个分类特征 / 属性的独立性。

We need the contingency table one more time.

We want to investigate the independence of rank and sex in Salaries data set.
我们想要调查 “薪水” Salaries 数据集中 “等级” rank 和 “性别” sex 的独立性。

Here null hypothesis : these two variables are associated;
alternative hypothesis: these two variables are NOT associated.

# Load the data
data("Salaries", package = "carData")
        rank     discipline yrs.since.phd    yrs.service        sex     
 AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39  
 AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358  
 Prof     :266              Median :21.00   Median :16.00               
                            Mean   :22.31   Mean   :17.61               
                            3rd Qu.:32.00   3rd Qu.:27.00               
                            Max.   :56.00   Max.   :60.00               
 Min.   : 57800  
 1st Qu.: 91000  
 Median :107300  
 Mean   :113706  
 3rd Qu.:134185  
 Max.   :231545  
# Create contingency table
contTable<- table(Salaries$rank, Salaries$sex)
            Female Male
  AsstProf      11   56
  AssocProf     10   54
  Prof          18  248
# Conduct Chi-squared Test
chisqtestResult<- chisq.test(contTable)
    Pearson's Chi-squared test

data:  contTable
X-squared = 8.5259, df = 2, p-value = 0.01408

Since we get a p-value of less than the significance level of 0.05
, we can reject the null hypothesis and conclude that the two variables are, indeed, independent.
由于我们得到的 p 值小于 0.05 的显著性水平,我们可以拒绝零假设,并得出结论,这两个变量确实是独立的。

# A problem with Pearson’s χ2\chi^2 皮尔逊χ2\chi^2 的问题

Coefficient is that the range of its maximum value depends on the sample size and the size of the contingency table.

These values may vary in different situations.

To overcome this problem, the coefficient can be standardized to lie between 0 and 1 so that it is independent of the sample size as well as the dimension of the contingency table.
为了克服这个问题,可以将系数标准化到 0 到 1 之间,这样它就独立于样本大小以及列联表的维数。

# Cramer's V (phi) Coefficient 克莱姆系数

Suppose we have a r×cr \times c contingency table, Cramer’s V as follow:
假设我们有一个 r×cr \times c 列联表,Cramer's V 如下:

V=χ2nmin(r1,c1),V = \sqrt{\frac{\chi^2}{n \min(r-1,c-1)}},

where χ2\chi^2 is the Chi-squared statistic, nn is the sample size, rr is the number of rows, and cc is the number of columns.
χ2\chi^2 是卡方统计量,nn 是样本量,rr 是行数,cc 是列数。

From the previous example we have

n <- nrow(Salaries)
chistats <- chisqtestResult$statistic
r <- 3
c <- 2
cramerv <- sqrt(chistats/n/min(r-1,c-1))

We can also use the function cramerV in package rcompanion to calculate Cramer's V value.
我们也可以使用 rcompanion 包中的 cramerV 函数来计算 Cramer's V 值。

#load rcompanion library
#calculate Cramer's V
Cramer V 

The range of Cramer's V value is from 0 to 1.
Cramer’s V 的取值范围是 0 到 1。

The value we got here is very small.

We can conclude there is no significant association between rank and sex .
我们可以得出结论,“等级” rank 和 “性别” sex 之间没有显著的联系。

# Reference

  1. Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall

  2. Correlation between discrete (categorical) variables, https://rpubs.com/hoanganhngo610/558925.

  3. Understanding ANOVA in R, https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html