Logistic Regression

library(tidyverse)
library(dplyr)
library(ggplot2)

Data

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

sbp systolic blood pressure
tobacco cumulative tobacco (kg) *ldl low densiity lipoprotein cholesterol adiposity
famhist family history of heart disease (Present, Absent)
typea type-A behavior
obesity
alcohol current alcohol consumption
age age at onset
chd response, coronary heart disease

load("data/elem_stat_learn_data/SAheart.RData")
SAheart%>%str()

'data.frame':   462 obs. of  10 variables:
 $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
 $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
 $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
 $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
 $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
 $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
 $ obesity  : num  25.3 28.9 29.1 32 26 ...
 $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
 $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
 $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

The response variable is chd

SAheart<-SAheart%>%mutate(famhist=as.factor(famhist))
SAheart %>%str()

'data.frame':   462 obs. of  10 variables:
 $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
 $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
 $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
 $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
 $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
 $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
 $ obesity  : num  25.3 28.9 29.1 32 26 ...
 $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
 $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
 $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...

To see the sample proportion of coronary heart disease

fit_logis<-glm(chd~.,data=SAheart,family = binomial(link="logit"))
summary(fit_logis)


Call:
glm(formula = chd ~ ., family = binomial(link = "logit"), data = SAheart)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -6.1507209  1.3082600  -4.701 2.58e-06 ***
sbp             0.0065040  0.0057304   1.135 0.256374    
tobacco         0.0793764  0.0266028   2.984 0.002847 ** 
ldl             0.1739239  0.0596617   2.915 0.003555 ** 
adiposity       0.0185866  0.0292894   0.635 0.525700    
famhistPresent  0.9253704  0.2278940   4.061 4.90e-05 ***
typea           0.0395950  0.0123202   3.214 0.001310 ** 
obesity        -0.0629099  0.0442477  -1.422 0.155095    
alcohol         0.0001217  0.0044832   0.027 0.978350    
age             0.0452253  0.0121298   3.728 0.000193 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 472.14  on 452  degrees of freedom
AIC: 492.14

Number of Fisher Scoring iterations: 5

Assessment of the model fitting (column sums are actual counts)

library(InformationValue)
InformationValue::confusionMatrix(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::sensitivity(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::specificity(actuals = SAheart$chd,predictedScores = predict(fit_logis))

Training and Test data

Sometimes we split the data into two groups: training and test, randomly.
The training is to develop the model, weheras, the test data are used for model validation.

SAheart %>%count(chd) %>%mutate(Prop=prop.table(n))%>%group_by(chd)

# A tibble: 2 × 3
# Groups:   chd [2]
    chd     n  Prop
  <int> <int> <dbl>
1     0   302 0.654
2     1   160 0.346

SAheart<-SAheart %>%mutate(ID=row_number())
set.seed(2007)
train<-SAheart%>%group_by(chd)%>%slice_sample(prop = 0.75)
test<-SAheart %>%anti_join(train,by="ID")

We will build the model using the train data

tr_fit<-glm(chd~.,data=train,family = binomial(link = "logit"))

Prediction on the test data can be done as

prd_scr<-predict(tr_fit,newdata=test)

After that, performance of the classification can be checked as

confusionMatrix(actuals = test$chd,predictedScores = prd_scr)
sensitivity(actuals = test$chd,predictedScores = prd_scr)
specificity(actuals = test$chd,predictedScores = prd_scr)