# Part I: Tests for a cybersecurity data set

Let's revisit cybersecurity breach report data downloaded 2015-02-26 from the US Health and Human Services.
From the Office for Civil Rights of the U.S. Department of Health and Human Services, I obtained the following information:

"As required by section 13402(e)(4) of the HITECH Act, the Secretary must post a list of breaches of unsecured protected health information affecting 500 or more individuals.

"Since October 2009 organizations in the U.S. that store data on human health are required to report any incident that compromises the confidentiality of 500 or more patients / human subjects (45 C.F.R. 164.408). These reports are publicly available. Our data set was downloaded from the Office for Civil Rights of the U.S. Department of Health and Human Services, 2015-02-26."

Load this data set and store it as cyberData , using the following code:

cyberData<-read.csv(url("https://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/HHSCyberSecurityBreaches.csv"))

As you know, this data set contains all reports regarding health information data breaches from 2009 to 2015. Let's pretend this is just a sample from the population of all data breaches, related or not to health information.

# Question 1.

Compare the number of individuals affected by data breaches (column Individuals.Affected ) in two states, Arkansas ( State=="AR" ) and California ( State=="CA" ).
This can be done by performing a test of difference in means, for example.
Repeat the same test for another pair of states, California ("CA") and Illinois ("IL").

Please note, in order to answer this question completely, you will need to run several lines of code, extract subsets of the data appropriately, run a statistical hypothesis test, and interpret the results. Draw a conclusion. Partial answers to the question will are insufficient.

AR <- cyberData[cyberData$State=="AR", ]
CA <- cyberData[cyberData$State=="CA", ]
IL <- cyberData[cyberData$State=="IL", ]

Since we don't know the variance, we need to use the t-test to compare the means. Before we compare the means, we can use the F test to see if the variances are equal

H0:σAR2=σCA2,H1:σAR2σCA2H_0: \sigma_{AR}^2 = \sigma_{CA}^2, \\ H_1: \sigma_{AR}^2 \ne \sigma_{CA}^2

# F test
var.test(AR$Individuals.Affected, CA$Individuals.Affected)
	F test to compare two variances

data:  AR$Individuals.Affected and CA$Individuals.Affected
F = 0.00066857, num df = 6, denom df = 127, p-value = 2.814e-09
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.0002664288 0.0032769357
sample estimates:
ratio of variances 
      0.0006685688 

The F test tells us that we should reject H0H_0. The variances are different.
We should do t-test with different variances.

H0:μARμCA=0,H1:μARμCA0H_0: \mu_{AR}-\mu_{CA} = 0,\\ H_1: \mu_{AR}-\mu_{CA} \ne 0

t.test(AR$Individuals.Affected, CA$Individuals.Affected)
	Welch Two Sample t-test

data:  AR$Individuals.Affected and CA$Individuals.Affected
t = -2.2841, df = 129.71, p-value = 0.02399
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -30145.686  -2161.579
sample estimates:
mean of x mean of y 
  2769.00  18922.63 

As p-value = 0.02399 is smaller than 0.05, we should reject H0H_0.
The mean value of Individuals.Affected is different in AR and CA.

H0:σCA2=σIL2,H1:σCA2σIL2H_0: \sigma_{CA}^2 = \sigma_{IL}^2, \\ H_1: \sigma_{CA}^2 \ne \sigma_{IL}^2

# F test
var.test(CA$Individuals.Affected, IL$Individuals.Affected)
	F test to compare two variances

data:  CA$Individuals.Affected and IL$Individuals.Affected
F = 0.02224, num df = 127, denom df = 56, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.01392852 0.03413975
sample estimates:
ratio of variances 
        0.02223981 

The F test tells us that we should reject H0H_0. The variances are different.
We should do t-test with different variances.

H0:μCAμIL=0,H1:μCAμIL0H_0: \mu_{CA}-\mu_{IL} = 0,\\ H_1: \mu_{CA}-\mu_{IL} \ne 0

t.test(CA$Individuals.Affected, IL$Individuals.Affected)
	Welch Two Sample t-test

data:  CA$Individuals.Affected and IL$Individuals.Affected
t = -0.87104, df = 57.112, p-value = 0.3874
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -203969.48   80308.12
sample estimates:
mean of x mean of y 
 18922.63  80753.32 

As p-value = 0.3874, we should not reject H0H_0.
The mean value of Individuals.Affected is same in CA and IL.

# Question 2.

Explore the variable Type.Of.Breach collected in this data set:

  • What proportion of data entries in cyberData have Type.of.Breach == "Hacking/IT Incident" ?
hackIT <- nrow(cyberData[cyberData$Type.of.Breach == "Hacking/IT Incident",0])
total <- nrow(cyberData)
prop <- hackIT / total

The proportion of data entries in cyberData have Type.of.Breach == "Hacking/IT Incident" is r prop .

  • What are all the different values of Type.Of.Breach reported in the data set? How many are hacking/IT incidents?
table(cyberData$Type.of.Breach)
                                       Hacking/IT Incident 
                                                        77 
                                Hacking/IT Incident, Other 
                                                         2 
Hacking/IT Incident, Other, Unauthorized Access/Disclosure 
                                                         1 
                                Hacking/IT Incident, Theft 
                                                         1 
Hacking/IT Incident, Theft, Unauthorized Access/Disclosure 
                                                         3 
       Hacking/IT Incident, Unauthorized Access/Disclosure 
                                                        10 
                                         Improper Disposal 
                                                        42 
                                   Improper Disposal, Loss 
                                                         3 
                            Improper Disposal, Loss, Theft 
                                                         3 
  Improper Disposal, Theft, Unauthorized Access/Disclosure 
                                                         1 
         Improper Disposal, Unauthorized Access/Disclosure 
                                                         2 
                                                      Loss 
                                                        79 
                                               Loss, Other 
                                                         2 
                                        Loss, Other, Theft 
                                                         1 
                                               Loss, Theft 
                                                        15 
                      Loss, Unauthorized Access/Disclosure 
                                                         5 
             Loss, Unauthorized Access/Disclosure, Unknown 
                                                         1 
                                             Loss, Unknown 
                                                         2 
                                                     Other 
                                                        89 
                                              Other, Theft 
                                                         5 
              Other, Theft, Unauthorized Access/Disclosure 
                                                         2 
                     Other, Unauthorized Access/Disclosure 
                                                         7 
                                            Other, Unknown 
                                                         2 
                                                     Theft 
                                                       577 
                     Theft, Unauthorized Access/Disclosure 
                                                        24 
            Theft, Unauthorized Access/Disclosure, Unknown 
                                                         1 
                            Unauthorized Access/Disclosure 
                                                       183 
                           Unauthorized Access/Disclosure  
                                                         1 
                                                   Unknown 
                                                        10 

There are 29 different types of Type.Of.Breach . There are 77 is hacking/IT incidents. There also exist some types include Hacking/IT Incident, like Hacking/IT Incident, Other, Hacking/IT Incident, Other, Unauthorized Access/Disclosure etc.

  • What type of breach is reported in the 748th row of cyberData ? How about 349th row? Was row 349 counted in the proportion of Hacking/IT incident breaches you computed above? Why or why not?
cyberData[748, 7]
cyberData[349, 7]
table(unlist(strsplit(cyberData$Type.of.Breach, ',')))
[1] "Loss, Theft"
[1] "Hacking/IT Incident, Unauthorized Access/Disclosure"

                           Loss                           Other                           Theft 
                              6                               6                              31 
 Unauthorized Access/Disclosure                         Unknown             Hacking/IT Incident 
                             57                               6                              94 
              Improper Disposal                            Loss                           Other 
                             51                             105                             105 
                          Theft  Unauthorized Access/Disclosure Unauthorized Access/Disclosure  
                            602                             183                               1 
                        Unknown 
                             10 

The type of breach is reported in the 748th row of cyberData is Loss, Theft.
The type of breach in 349th row is Hacking/IT Incident, Unauthorized Access/Disclosure.
The row 349 is not counted in the proportion of Hacking/IT incident breaches you computed above as "Hacking/IT Incident, Unauthorized Access/Disclosure" not exactly match "Hacking/IT Incident".

  • Perform a hypothesis test on whether there is a difference in proportion of Hacking/IT incidents between the state of Illinois and the state of California. Write your conclusion interpreting the results of the statistical test.
table(IL$Type.of.Breach)
x_IL<-sum(IL$Type.of.Breach=="Hacking/IT Incident")
n_IL<-length(IL$Type.of.Breach)
table(CA$Type.of.Breach)
x_CA<-sum(CA$Type.of.Breach=="Hacking/IT Incident")
n_CA<-length(CA$Type.of.Breach)
x<-c(x_IL,x_CA)
n<-c(n_IL,n_CA)
prop.test(x,n)

After prop test , the p value is 0.05505 which is larger than 0.05.
So there is no difference in proportion of Hacking/IT incidents between the state of Illinois and the state of California.


# Part II: Review of basic concepts in statistical learning

You will spend some time thinking of some real-life applications for statistical learning.

# Question 3.

Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

  1. Forecast of weather forecast. The response is the weather, like rain, sunny, cloudy. The predictors can be humidity, visibility, wind speed, pressure etc. This application is prediction, we can use earlier weather meteorological indicators to predict future weather.
  2. Diagnosis of cancer in patients. The response is the patient is cancer or not. The predictors can be some index in routine blood test, some indicators of cancer etc. This application is inference, we can use the classification method to separate cancer or non-cancer people or identify cancer markers.
  3. Predict the wine's quality (good or bad). The response is the wine is good or bad, the predictors are chroma, acidity, alcohol purity etc. We can use this application to judge the unknown quality of wine.

# Question 4.

Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

  1. Calories burned by running. The response is the calories burned, the predictors are gender, age, weight, IBM, etc. The goal is to predict the calories that burned by people.
  2. Predict the price of diamonds. The response is the price of diamond, the predictors are carat, cut, color, clarity, etc. The goal is to predict the price of diamond.
  3. Predict of GDP. The response is the GDP, the predictors are country, the number of women, the number of child, education level, medical conditions, etc. The goal is to predict the GDP for any country.

# Question 5.

Describe three real-life applications in which cluster analysis might be useful.

  1. Sequencing analysis like Single cell sequencing. We can know the heterogeneity of different single-cell groups.
  2. We can cluster students by their achievements, then students with similar achievements are clustered together.
  3. Market analysis. We can divided the market by several clusters, then analysis different clusters by different methods.

# Question 6.

What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

The advantage of flexible approach for regression or classification is the accuracy is good, but disadvantage is the robust is not good as less flexible approach.
The advantages of less flexible approach for regression or classification is robust is good, but disadvantage is the accuracy is not good as flexible approach.

# Part III: Simple and Multiple Linear Regression

Load the Boston data set

# import packages
library(MASS)
#load data
data(Boston)

# Question 7:

Construct a simple linear regression of medv with crim , dis , and age respectively. Based on the output, answer the following questions:

  • Is there a relationship between the predictor and the response?

  • How strong is the relationship between the predictor and the response?

  • Is the relationship between the predictor and the response positive or negative?

  • Based on the RSE and R2R^2, which model will you choose for the simple linear regression? Explain it.

fit1<-lm(medv~crim,data=Boston)
summary(fit1)
fit2<-lm(medv~dis,data=Boston)
summary(fit2)
fit3<-lm(medv~age,data=Boston)
summary(fit3)
summary(fit1)$r.square
summary(fit2)$r.square
summary(fit3)$r.square
        
RSE1<-mean((fitted(fit1)- Boston$medv) ^2)
RSE2<-mean((fitted(fit2)- Boston$medv) ^2)
RSE3<-mean((fitted(fit3)- Boston$medv) ^2)
RSE1
RSE2
RSE3

There is relationship between medv and crime, medv and dis, medv and age as p values smaller than 0.05. The relationship is very strong as p is very small and has three significance star (<0.001).

The relationship between medv and crime is negative.
The relationship between medv and dis is positive.
The relationship between medv and age is negative.

I will choose predictor of crime with the highest R square and the smallest RSE.

# Question 8:

Please use all the other features/attributes to construct a linear regression model.

  • Interpret the coefficients of all the attributes. Which attributes are insignificant?

  • Remove the insignificant attributes and construct a new linear regression model

  • Any improvement on the RSE and R2R^2?

fit_all<-lm(medv~.,data=Boston)
summary(fit_all)
fit_improve<-lm(medv~.-indus-age,data=Boston)
summary(fit_all)
summary(fit_all)$r.square
summary(fit_improve)$r.square
        
RSE1<-mean((fitted(fit_all)- Boston$medv) ^2)
RSE2<-mean((fitted(fit_improve)- Boston$medv) ^2)
RSE1
RSE2

Predictors of crim, zn, chas, nox, rm, dis, rad, tax, ptratio, black, and lstat are significant with p values smaller than 0.05. And predictors of indus and age are insignificant with p values larger than 0.05.

If the coefficient is positive, it means this predictor has positive effect on medv; if the coefficient is negative, it means this predictor has negative effect on medv.
Like the coefficient of crim is -0.0108, it means when crim increase by 1, medv will decrease by 0.0108. the coefficient of zn is 0.0464, it means when zn increase by 1, medv will increase by 0.0464.

The R2 in the new model increase compared with question 7. There is no difference between fit_all and fit_improve, so fit_improve is best among all built models.