# # Objective

• Confidence Interval V.S. Prediction Interval
置信区间与预测区间
• Qualitative Predictors
定性预测因子
• Extension of the Linear Models
线性模型的扩展

# # Goodness-of-Fit Test 拟合优度检验

## # Overview

What is Goodness-of-Fit Test?

It is a test to determine if a population has a specified theoretical distribution.

The test is based on how good a fit we have between the frequency of occurrence of observations in an observed sample and the expected frequencies obtained from the hypothesized distribution.

## # Illustrating Example 例子说明

we consider the tossing of a die.

We hypothesize that the die is honest, which is equivalent to testing the hypothesis that the distribution of outcomes is the discrete uniform distribution

$f(x) = \frac{1}{6}, \qquad x = 1,2, \ldots, 6.$

Suppose that the die is tossed 120 times and each outcome is recorded.

Theoretically, if the die is balanced, we would expect each face to occur 20 times.

The results are given in the following table.

A goodness-of-fit test between observed and expected frequencies is based on the quantity

$\chi^2 = \sum_{i=1}^k\frac{(o_i-e_i)^2}{e_i},$

where $\chi^2$ is a value of a random variable whose sampling distribution is approximated very closely by the chi-squared distribution with $v = k - 1$ degrees of freedom.

The symbols $o_i$ and $e_i$ represent the observed and expected frequencies, respectively, for the ith cell.

If the observed frequencies are close to the corresponding expected frequencies, the $\chi^2$-value will be small, indicating a good fit.

If the observed frequencies differ considerably from the expected frequencies, the $\chi^2$-value will be large and the fit is poor.

A good fit leads to the acceptance of $H_0$, whereas a poor fit leads to its rejection.

The critical region will, therefore, fall in the right tail of the chi-squared distribution.

For a level of significance equal to $\alpha$, we find the critical value $\chi^2_\alpha$, and then $\chi^2 > \chi^2_\alpha$ constitutes the critical region.

The decision criterion described here should not be used unless each of the expected frequencies is at least equal to 5.

$\chi^2 = \frac{(20-20)^2}{20} + \frac{(22-20)^2}{20} + \frac{(17-20)^2}{20} \\ \frac{(18-20)^2}{20} + \frac{(19-20)^2}{20} + \frac{(24-20)^2}{20} = 1.7.$

[1] 11.0705

We can find $\chi^2 < \chi^2_\alpha$, we fail to reject $H_0$.

We conclude that there is insufficient evidence that the die is not balanced.

Can we use test in r to solve it? Yes.

Chi-squared test for given probabilities

data:  obs
X-squared = 1.7, df = 5, p-value = 0.8889

We can find the $\chi^2$ statistic is exactly the same as what we got from the formula.

As p-value is much larger than $\alpha = 0.05$, we should not reject $H_0$.

Can we use prop.test ?

6-sample test for equality of proportions without continuity
correction

data:  obs out of rep(120, 6)
X-squared = 2.04, df = 5, p-value = 0.8436
alternative hypothesis: two.sided
sample estimates:
prop 1    prop 2    prop 3    prop 4    prop 5    prop 6
0.1666667 0.1833333 0.1416667 0.1500000 0.1583333 0.2000000

Note: When the sample size is greater than 2, the continuity correction is never used. Why?

The continuous chi-squared distribution seems to approximate the discrete sampling distribution of $\chi^2$ very well, provided that the number of degrees of freedom is greater than 1.

In a 2$\times$ 2 contingency table, where we have only 1 degree of
freedom, a correction called Yates’ correction for continuity is applied.

The corrected formula then becomes

$\chi^2(corrected) = \sum_{i = 1}\frac{(|o_i-e_i|-0.5)^2}{e_i}.$

Contingency table 列联表
is a table with observed frequencies.

The following is $2 \times 3$ contingency table.

How to use chisq.test ?

IncomeLevel
reform    Low Medium High
For     182    213  203
Against 154    138  110
IncomeLevel
reform      Low Medium  High
For     0.182  0.213 0.203
Against 0.154  0.138 0.110
Pearson's Chi-squared test

data:  taxreform
X-squared = 7.8782, df = 2, p-value = 0.01947

Why degree freedom is 2? Actually

$v = (r-1)(c-1),$

where $r$ is # of rows and $c$ is # of columns.

We can't use prop.test here.

## # In-Class Exercise

There are 848 observations in before2013 dataset and 49 of those with Type.of.Breach == "Hacking/IT Incident" .
before2013 数据集中有 848 个观测值，其中 49 个带有 Type.of.Breach == "Hacking/IT Incident"

For after2013 dataset, there are 303 observations and 28 of those with Type.of.Breach == "Hacking/IT Incident" .

Can you construct a contingency table and test with chisq.test ? What about prop.test ?

##             breachType
## year         Hacking/IT Incident Others
##   before2013                  49    799
##   after2013                   28    275
##             breachType
## year         Hacking/IT Incident    Others
##   before2013          0.04257168 0.6941790
##   after2013           0.02432667 0.2389227
##
##  Pearson's Chi-squared test with Yates' continuity correction
##
## data:  breachTable
## X-squared = 3.751, df = 1, p-value = 0.05278
##
##  2-sample test for equality of proportions with continuity correction
##
## data:  breachTable
## X-squared = 3.751, df = 1, p-value = 0.05278
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.073059117  0.003806673
## sample estimates:
##     prop 1     prop 2
## 0.05778302 0.09240924

# # Confidence vs. prediction intervals

## # Revisit Auto Example

• Auto data set, regression on Y=mpg vs. X=horsepower .
Auto 数据集，回归 Y=mpg vs. X=horsepower
• What is the predicted mpg associated with a horsepower of $98$? What are the associated 95% confidence and prediction intervals?
horsepower$98$ 的相关的预测 mpg 是多少？相关的 95% 置信区间和预测区间是什么？
1
24.46708
fit      lwr      upr
1 24.46708 23.97308 24.96108
fit     lwr      upr
1 24.46708 14.8094 34.12476

Three sorts of uncertainty associated with the prediction of $Y$ based on $X_1,\dots,X_p$:

1. $\hat\beta_i\approx\beta_i$: least squares plane is an estimate for the true regression plane.
最小二乘平面是对真实回归平面的估计
• reducible error
可约误差
2. assuming a linear model for $f(X)$ is usually an approximation of reality
假设$f(X)$ 的线性模型通常是现实的近似值
• model bias [potential reducible error?]
模型偏差【潜在的可减少误差？】
• to operate here, we ignore this discrepancy
为了在这里操作，我们忽略这个差异
3. even if we knew true $\beta_i$, still no perfect knowledge of $Y$ because of random error $\epsilon$
即使我们知道$\beta_i$ 为真，由于随机错误$\epsilon$，仍然不完全知道$Y$
• irreducible error
不可约误差
• how much will $Y$ vary from $\hat Y$?
$Y$$\hat Y$ 有多少区别？
• we use prediction intervals. Always wider than confidence intervals.
我们使用预测区间。始终大于置信区间。

Confidence interval 置信区间
Quantify the uncertainty surrounding the average sales over a large number of cities.

For example:

• given that $100,000 is spent on TV advertising and 10 万美元用于电视广告 •$20,000 is spent on radio advertising in each city,
广播广告花费 2 万美元
• the 95% confidence interval is [10,985, 11,528] .
95% 的置信区间
• We interpret this to mean that 95% of intervals of this form will contain the true value of $f(X)$.
我们对此的解释是，此形式的 95% 的区间将包含$f(X)$ 的真实值。

Prediction interval 预测区间
Can be used to quantify the uncertainty surrounding sales for a particular city.

• Given that $100,000 is spent on TV advertising and 10 万美元用于电视广告和 •$20,000 is spent on radio advertising in that city
2 万美元做广播广告
• the 95% prediction interval is [7,930, 14,580] .
95% 的预测区间
• We interpret this to mean that 95% of intervals of this form will contain the true value of $Y$ for this city.
我们对此的解释是，此形式的 95% 的区间将包含该城市$Y$ 的真实值

Note that both intervals are centered at 11,256, but that the prediction interval is substantially wider than the confidence interval, reflecting the increased uncertainty about sales for a given city in comparison to the average sales over many locations.

# # Qualitative predictors 定性预测变量

Some predictors are not quantitative but are qualitative, taking a discrete set of values.

These are also called categorical predictors or factor variables.

See for example the scatterplot matrix of the credit card data in the following figure.

In addition to the 7 quantitative variables shown, there are four qualitative variables: gender, student (student status), status (marital status), and ethnicity (Caucasian, African American (AA) or Asian).

Example: investigate differences in credit card balance between males and females, ignoring the other variables. We create a new dummy variable

x_i = \begin{cases} 1 \mbox{, if $i$th person is female} \\ 0\mbox{, if $i$th person is not female} \end{cases}

Resulting model:

y_i=\beta_0+\beta_1x_i +\epsilon = \begin{cases} \beta_0+\beta_1+\epsilon_i \mbox{, if $i$th person is female} \\ \beta_0+\epsilon_i \mbox{, if $i$th person is not female} \end{cases}

Interpretation?

• $\beta_0$ = average $Y$ among non-females
• $\beta_0+\beta_1$ = average $Y$ among females
• $\beta_1$ average difference in $Y$ between the two groups.

Results for gender model:

With more than two levels, we create additional dummy variables.

For example, for the ethnicity variable we create two dummy variables.

The first could be

x_{i1} = \begin{cases} 1 \mbox{, if $i$th person is Asian} \\ 0\mbox{, if $i$th person is not Asian} \end{cases}

and the second could be

x_{i2} = \begin{cases} 1 \mbox{, if $i$th person is Caucasian} \\ 0\mbox{, if $i$th person is not Caucasian} \end{cases}

Then both of these variables can be used in the regression equation, in order to obtain the model

$y_{i}=\beta_{0}+\beta_{1} x_{i 1}+\beta_{2} x_{i 2}+\epsilon_{i}=\left\{\begin{array}{ll} \beta_{0}+\beta_{1}+\epsilon_{i} & \text { if } i \text { th person is Asian } \\ \beta_{0}+\beta_{2}+\epsilon_{i} & \text { if } i \text { th person is Caucasian } \\ \beta_{0}+\epsilon_{i} & \text { if } i \text { th person is AA } \end{array}\right.$

There will always be one fewer dummy variable than the number of levels.

The level with no dummy variable

African American in this example --- is known as the baseline.

## # Qualitative predictors in R R 中的定性预测变量

How to create dummy variable in R ?

We use a date set Salaries from package carData

rank     discipline yrs.since.phd    yrs.service        sex
AsstProf : 67   A:181      Min.   : 1.00   Min.   : 0.00   Female: 39
AssocProf: 64   B:216      1st Qu.:12.00   1st Qu.: 7.00   Male  :358
Prof     :266              Median :21.00   Median :16.00
Mean   :22.31   Mean   :17.61
3rd Qu.:32.00   3rd Qu.:27.00
Max.   :56.00   Max.   :60.00
salary
Min.   : 57800
1st Qu.: 91000
Median :107300
Mean   :113706
3rd Qu.:134185
Max.   :231545

## # Qualitative predictors with two levels 两级定性预测变量

If we use sex as an argument in lm , R will correctly treat sex as single dummy variable with two categories.

Call:
lm(formula = salary ~ sex, data = Salaries)

Residuals:
Min     1Q Median     3Q    Max
-57290 -23502  -6828  19710 116455

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   101002       4809  21.001  < 2e-16 ***
sexMale        14088       5065   2.782  0.00567 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 30030 on 395 degrees of freedom
Multiple R-squared:  0.01921,   Adjusted R-squared:  0.01673
F-statistic: 7.738 on 1 and 395 DF,  p-value: 0.005667

## # Qualitative predictors with more than two levels 两级以上的定性预测变量

What about the categorical data with more than two levels?

As rank is a factor with three levels, R automatically create two dummy variables.

Call:
lm(formula = salary ~ rank, data = Salaries)

Residuals:
Min     1Q Median     3Q    Max
-68972 -16376  -1580  11755 104773

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      80776       2887  27.976  < 2e-16 ***
rankAssocProf    13100       4131   3.171  0.00164 **
rankProf         45996       3230  14.238  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23630 on 394 degrees of freedom
Multiple R-squared:  0.3943,    Adjusted R-squared:  0.3912
F-statistic: 128.2 on 2 and 394 DF,  p-value: < 2.2e-16

Question: What is the baseline here?

## # In-Calss Exercise

• Please construct a simple linear regression model to predict salary by discipline .
请构造一个简单的线性回归模型，根据 discipline 预测 salary

##
## Call:
## lm(formula = salary ~ discipline, data = Salaries)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -50748 -24611  -4429  19138 113516
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   108548       2227  48.751  < 2e-16 ***
## disciplineB     9480       3019   3.141  0.00181 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29960 on 395 degrees of freedom
## Multiple R-squared:  0.02436,    Adjusted R-squared:  0.02189
## F-statistic: 9.863 on 1 and 395 DF,  p-value: 0.001813

• Please construct a multiple linear regression model to predict salary with yrs.sevice , rank , and sex .
请构建一个多元线性回归模型，用 yrs.seviceranksex 预测 salary
Hint You can use + to connect the attributes/features.
提示 可以使用 + 连接属性 / 功能

##
## Call:
## lm(formula = salary ~ yrs.service + rank + sex, data = Salaries)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -64500 -15111  -1459  11966 107011
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)    76612.8     4426.0  17.310  < 2e-16 ***
## yrs.service     -171.8      115.3  -1.490  0.13694
## rankAssocProf  14702.9     4266.6   3.446  0.00063 ***
## rankProf       48980.2     3991.8  12.270  < 2e-16 ***
## sexMale         5468.7     4035.3   1.355  0.17613
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23580 on 392 degrees of freedom
## Multiple R-squared:    0.4,  Adjusted R-squared:  0.3938
## F-statistic: 65.32 on 4 and 392 DF,  p-value: < 2.2e-16

# # Extensions of the linear model 线性模型的扩展

What is wrong with the linear model? It works quite well!

Yes -- but sometimes the (restrictive) assumptions are violated in practice.

The relationship between the predictors and response is additive.

effect of changes in a predictor $X_j$ on the response $Y$ is independent of the values of the other predictors.

Assumption 2: linearity 假设 2: 线性
The relationship between the predictors and response is linear.

the change in the response $Y$ due to a one-unit change in $X_j$ is constant, regardless of the value of $X_j$.

## # Removing the additive assumption 去掉加性假设

Previous analysis of Advertising data: both TV and radio seem associated with sales.

• The linear models that formed the basis for this conclusion:
构成这一结论基础的线性模型：

$sales=\beta_0+ \beta_1 TV + \beta_2 radio + \beta_3 newspaper +\epsilon$

The above model states that the average effect on sales of a one-unit increase in TV is always $\beta_1,$ regardless of the amount spent on radio.

But suppose that spending money on radio advertising actually increases the effectiveness of TV advertising, so that the slope term for TV should increase as radio increases.

In this situation, given a fixed budget of \$100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio.

In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect.

• We will now explain how to augment this model by allowing interaction between radio and TV in predicting sales :
现在，我们将解释如何通过允许 radioTV 在预测 “销售额” sales 时进行交互来扩展此模型：

$sales =\beta_0+\beta_1\times TV+\beta_2\times radio +\beta_3 \times (radio\times TV)+\epsilon\\ = \beta_0+ \textcolor{blue}{(\beta_1+\beta_3\times radio)\times TV} + \beta_2\times radio + \epsilon.$

### # Interpretation

• $\beta_3$ = increase in the effectiveness of TV advertising for a one unit increase in radio advertising (or vice-versa).
电视广告的有效性增加一个单位，广播广告增加一个单位（反之亦然）。

Interpretation

• The results in this table suggests that interactions are
important.
此表中的结果表明相互作用很重要。

• The p-value for the interaction term TV $\times$ radio is extremely low, indicating that there is strong evidence for $H_A : \beta_3\ne 0.$
交互术语 TV $\times$ radio 的 p 值极低，表明有充分证据表明$H_A : \beta_3\ne 0.$

• The R2 for the interaction model is 96.8%, compared to only 89.7% for the model that predicts sales using TV and radio without an interaction term.
互动模型的 R2 为 96.8%，相比之下，不使用互动术语预测电视和广播销售的模型仅为 89.7%。

## # Hierarchy Principle 等级原则

Sometimes it is the case that an interaction term has a very small p-value, but the associated main effects (in this case, TV and radio) do not.

If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.

The rationale for this principle is that interactions are hard to interpret in a model without main effects --- their meaning is changed.

Specifically, the interaction terms also contain main effects, if the model has no main effect terms.

We can also have the interactions between qualitative and quantitative variables.

Consider the Credit data set, and suppose that we wish to predict balance using income (quantitative) and student (qualitative).

Without an interaction term, the model takes the form

\begin{aligned} \text { balance }_{i} & \approx \beta_{0}+\beta_{1} \times \text { income }_{i}+\left\{\begin{array}{ll} \beta_{2} & \text { if } i \text { th person is a student } \\ 0 & \text { if } i \text { th person is not a student } \end{array}\right.\\ &=\beta_{1} \times \text { income }_{i}+\left\{\begin{array}{ll} \beta_{0}+\beta_{2} & \text { if } i \text { th person is a student } \\ \beta_{0} & \text { if } i \text { th person is not a student. } \end{array}\right. \end{aligned}

With interactions, it takes the form

\begin{aligned} \text { balance }_{i} & \approx \beta_{0}+\beta_{1} \times \text { income }_{i}+\left\{\begin{array}{ll} \beta_{2}+\beta_{3} \times \text { income }_{i} & \text { if student } \\ 0 & \text { if not student } \end{array}\right.\\ &=\left\{\begin{array}{ll} \left(\beta_{0}+\beta_{2}\right)+\left(\beta_{1}+\beta_{3}\right) \times \text { income }_{i} & \text { if student } \\ \beta_{0}+\beta_{1} \times \text { income }_{i} & \text { if not student } \end{array}\right. \end{aligned}

For the Credit data, the least squares lines are shown for prediction of balance from income for students and non-students.

Left: The model was fit. There is no interaction between income and student.

Right: The model was fit. There is an interaction term between income and student.

## # Interaction in R R 中的相互作用

Let's revisit the auto data set. Here we want to build a model as follows.

$mpg = \beta_0 + \beta_1 \times horsepower + \beta_2 \times horsepower^2 + \epsilon$

How to do it in R?

Call:
lm(formula = mpg ~ horsepower + horsepowerSquare, data = Auto)

Residuals:
Min       1Q   Median       3Q      Max
-14.7135  -2.5943  -0.0859   2.2868  15.8961

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)      56.9000997  1.8004268   31.60   <2e-16 ***
horsepower       -0.4661896  0.0311246  -14.98   <2e-16 ***
horsepowerSquare  0.0012305  0.0001221   10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Actaully, if it is the interaction with two different features, we just need to use * .

For example, we want to check the interactions between horsepower and weight .

Call:
lm(formula = mpg ~ horsepower * weight, data = Auto)

Residuals:
Min       1Q   Median       3Q      Max
-10.7725  -2.2074  -0.2708   1.9973  14.7314

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.93 on 388 degrees of freedom
Multiple R-squared:  0.7484,    Adjusted R-squared:  0.7465
F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

Question: Why I don't need to + horsepower + weight ?

We can use the same way to construct interaction regression model with one numeric attribute and one categorical attribute.

Note: Only when these two features have interactions, our interactions model can lead to a better performance.

### # In-Calss Exercise:

• Please construct an interaction regression model to predict salary with yrs.service and yrs.since.phd for the Salaries data set.
请建立一个交互式回归模型，通过 yrs.serviceyrs.since.phd ，为 Salaries 数据集预测 salary

##
## Call:
## lm(formula = salary ~ yrs.service * yrs.since.phd, data = Salaries)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -63823 -17292  -2538  13158 107001
##
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)               70155.263   3472.077  20.206  < 2e-16 ***
## yrs.service                1692.446    356.279   4.750 2.85e-06 ***
## yrs.since.phd              2194.289    246.862   8.889  < 2e-16 ***
## yrs.service:yrs.since.phd   -64.617      7.487  -8.630  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25120 on 393 degrees of freedom
## Multiple R-squared:  0.3177, Adjusted R-squared:  0.3125
## F-statistic: 60.99 on 3 and 393 DF,  p-value: < 2.2e-16

• Please construct an interaction regression model to predict salary with yrs.service and rank for the Salaries data set.
Note rank is a categorical feature.
请建立一个交互式回归模型，用 yrs.servicerank 来为 Salaries 数据集预测 salary

##
## Call:
## lm(formula = salary ~ yrs.service * rank, data = Salaries)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -65609 -15841  -1047  11177 106585
##
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                78926.6     5445.1  14.495  < 2e-16 ***
## yrs.service                  779.3     1945.2   0.401  0.68891
## rankAssocProf              19667.6     7127.5   2.759  0.00606 **
## rankProf                   50567.8     6318.2   8.004 1.39e-14 ***
## yrs.service:rankAssocProf  -1174.0     1967.4  -0.597  0.55104
## yrs.service:rankProf        -898.6     1949.2  -0.461  0.64505
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23640 on 391 degrees of freedom
## Multiple R-squared:  0.3986, Adjusted R-squared:  0.391
## F-statistic: 51.84 on 5 and 391 DF,  p-value: < 2.2e-16

##
## Call:
## lm(formula = salary ~ yrs.service * yrs.since.phd + rank, data = Salaries)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -53996 -16032  -2237  12039 107358
##
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)               76090.368   3408.690  22.322  < 2e-16 ***
## yrs.service                 479.244    394.043   1.216   0.2246
## yrs.since.phd               761.967    314.188   2.425   0.0158 *
## rankAssocProf              7031.646   5056.227   1.391   0.1651
## rankProf                  36534.191   6057.562   6.031 3.77e-09 ***
## yrs.service:yrs.since.phd   -24.656      9.639  -2.558   0.0109 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23440 on 391 degrees of freedom
## Multiple R-squared:  0.4088, Adjusted R-squared:  0.4012
## F-statistic: 54.07 on 5 and 391 DF,  p-value: < 2.2e-16

# # Potential problems 潜在的问题

## # Common issues and problems 共同的问题和困难

1. Non-linearity of the response-predictor relationships.
响应 - 预测关系的非线性。
2. Correlation of error terms.
误差项的相关性。
3. Non-constant variance of error terms.
误差项的非常数方差。
4. Outliers. 异常值
5. High-leverage points. 高杠杆点。
6. Collinearity 共线性

# # Reference

1. Probability & Statistics for Engineers & Scientist, 9th Edition, Ronald E. Walpole, Raymond H. Myers, Sharon L. Myers, Keying Ye, Prentice Hall

2. Chapter 3 of the textbook Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, An Introduction to Statistical Learning: with Applications in R.

3. Part of this lecture notes are extracted from Prof. Sonja Petrovic ITMD/ITMS 514 lecture notes.