Logistic Regression

library(tidyverse)
library(dplyr)
library(ggplot2)

Data

A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.

  • sbp systolic blood pressure
  • tobacco cumulative tobacco (kg) *ldl low densiity lipoprotein cholesterol adiposity
  • famhist family history of heart disease (Present, Absent)
  • typea type-A behavior
  • obesity
  • alcohol current alcohol consumption
  • age age at onset
  • chd response, coronary heart disease
load("data/elem_stat_learn_data/SAheart.RData")
SAheart%>%str()
'data.frame':   462 obs. of  10 variables:
 $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
 $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
 $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
 $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
 $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
 $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
 $ obesity  : num  25.3 28.9 29.1 32 26 ...
 $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
 $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
 $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...
  • The response variable is chd
SAheart<-SAheart%>%mutate(famhist=as.factor(famhist))
SAheart %>%str()
'data.frame':   462 obs. of  10 variables:
 $ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
 $ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
 $ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
 $ adiposity: num  23.1 28.6 32.3 38 27.8 ...
 $ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
 $ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
 $ obesity  : num  25.3 28.9 29.1 32 26 ...
 $ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
 $ age      : int  52 63 46 58 49 45 38 58 29 53 ...
 $ chd      : int  1 1 0 1 1 0 0 1 0 1 ...
  • To see the sample proportion of coronary heart disease
fit_logis<-glm(chd~.,data=SAheart,family = binomial(link="logit"))
summary(fit_logis)

Call:
glm(formula = chd ~ ., family = binomial(link = "logit"), data = SAheart)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -6.1507209  1.3082600  -4.701 2.58e-06 ***
sbp             0.0065040  0.0057304   1.135 0.256374    
tobacco         0.0793764  0.0266028   2.984 0.002847 ** 
ldl             0.1739239  0.0596617   2.915 0.003555 ** 
adiposity       0.0185866  0.0292894   0.635 0.525700    
famhistPresent  0.9253704  0.2278940   4.061 4.90e-05 ***
typea           0.0395950  0.0123202   3.214 0.001310 ** 
obesity        -0.0629099  0.0442477  -1.422 0.155095    
alcohol         0.0001217  0.0044832   0.027 0.978350    
age             0.0452253  0.0121298   3.728 0.000193 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 596.11  on 461  degrees of freedom
Residual deviance: 472.14  on 452  degrees of freedom
AIC: 492.14

Number of Fisher Scoring iterations: 5
  • Assessment of the model fitting (column sums are actual counts)
library(InformationValue)
InformationValue::confusionMatrix(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::sensitivity(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::specificity(actuals = SAheart$chd,predictedScores = predict(fit_logis))

Training and Test data

  • Sometimes we split the data into two groups: training and test, randomly.
  • The training is to develop the model, weheras, the test data are used for model validation.
SAheart %>%count(chd) %>%mutate(Prop=prop.table(n))%>%group_by(chd) 
# A tibble: 2 × 3
# Groups:   chd [2]
    chd     n  Prop
  <int> <int> <dbl>
1     0   302 0.654
2     1   160 0.346
SAheart<-SAheart %>%mutate(ID=row_number())
set.seed(2007)
train<-SAheart%>%group_by(chd)%>%slice_sample(prop = 0.75)
test<-SAheart %>%anti_join(train,by="ID")
  • We will build the model using the train data
tr_fit<-glm(chd~.,data=train,family = binomial(link = "logit"))
  • Prediction on the test data can be done as
prd_scr<-predict(tr_fit,newdata=test)
  • After that, performance of the classification can be checked as
confusionMatrix(actuals = test$chd,predictedScores = prd_scr)
sensitivity(actuals = test$chd,predictedScores = prd_scr)
specificity(actuals = test$chd,predictedScores = prd_scr)