library(tidyverse)
library(dplyr)
library(ggplot2)
Logistic Regression
Data
A retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD. Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.
- sbp systolic blood pressure
- tobacco cumulative tobacco (kg) *ldl low densiity lipoprotein cholesterol adiposity
- famhist family history of heart disease (Present, Absent)
- typea type-A behavior
- obesity
- alcohol current alcohol consumption
- age age at onset
- chd response, coronary heart disease
load("data/elem_stat_learn_data/SAheart.RData")
%>%str() SAheart
'data.frame': 462 obs. of 10 variables:
$ sbp : int 160 144 118 170 134 132 142 114 114 132 ...
$ tobacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
$ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
$ adiposity: num 23.1 28.6 32.3 38 27.8 ...
$ famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
$ typea : int 49 55 52 51 60 62 59 62 49 69 ...
$ obesity : num 25.3 28.9 29.1 32 26 ...
$ alcohol : num 97.2 2.06 3.81 24.26 57.34 ...
$ age : int 52 63 46 58 49 45 38 58 29 53 ...
$ chd : int 1 1 0 1 1 0 0 1 0 1 ...
- The response variable is
chd
<-SAheart%>%mutate(famhist=as.factor(famhist))
SAheart%>%str() SAheart
'data.frame': 462 obs. of 10 variables:
$ sbp : int 160 144 118 170 134 132 142 114 114 132 ...
$ tobacco : num 12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
$ ldl : num 5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
$ adiposity: num 23.1 28.6 32.3 38 27.8 ...
$ famhist : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
$ typea : int 49 55 52 51 60 62 59 62 49 69 ...
$ obesity : num 25.3 28.9 29.1 32 26 ...
$ alcohol : num 97.2 2.06 3.81 24.26 57.34 ...
$ age : int 52 63 46 58 49 45 38 58 29 53 ...
$ chd : int 1 1 0 1 1 0 0 1 0 1 ...
- To see the sample proportion of coronary heart disease
<-glm(chd~.,data=SAheart,family = binomial(link="logit"))
fit_logissummary(fit_logis)
Call:
glm(formula = chd ~ ., family = binomial(link = "logit"), data = SAheart)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.1507209 1.3082600 -4.701 2.58e-06 ***
sbp 0.0065040 0.0057304 1.135 0.256374
tobacco 0.0793764 0.0266028 2.984 0.002847 **
ldl 0.1739239 0.0596617 2.915 0.003555 **
adiposity 0.0185866 0.0292894 0.635 0.525700
famhistPresent 0.9253704 0.2278940 4.061 4.90e-05 ***
typea 0.0395950 0.0123202 3.214 0.001310 **
obesity -0.0629099 0.0442477 -1.422 0.155095
alcohol 0.0001217 0.0044832 0.027 0.978350
age 0.0452253 0.0121298 3.728 0.000193 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 596.11 on 461 degrees of freedom
Residual deviance: 472.14 on 452 degrees of freedom
AIC: 492.14
Number of Fisher Scoring iterations: 5
- Assessment of the model fitting (column sums are actual counts)
library(InformationValue)
::confusionMatrix(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::sensitivity(actuals = SAheart$chd,predictedScores = predict(fit_logis))
InformationValue::specificity(actuals = SAheart$chd,predictedScores = predict(fit_logis)) InformationValue
Training and Test data
- Sometimes we split the data into two groups:
training
andtest
, randomly. - The
training
is to develop the model, weheras, thetest
data are used for model validation.
%>%count(chd) %>%mutate(Prop=prop.table(n))%>%group_by(chd) SAheart
# A tibble: 2 × 3
# Groups: chd [2]
chd n Prop
<int> <int> <dbl>
1 0 302 0.654
2 1 160 0.346
<-SAheart %>%mutate(ID=row_number())
SAheartset.seed(2007)
<-SAheart%>%group_by(chd)%>%slice_sample(prop = 0.75)
train<-SAheart %>%anti_join(train,by="ID") test
- We will build the model using the
train
data
<-glm(chd~.,data=train,family = binomial(link = "logit")) tr_fit
- Prediction on the
test
data can be done as
<-predict(tr_fit,newdata=test) prd_scr
- After that, performance of the classification can be checked as
confusionMatrix(actuals = test$chd,predictedScores = prd_scr)
sensitivity(actuals = test$chd,predictedScores = prd_scr)
specificity(actuals = test$chd,predictedScores = prd_scr)