Machine Learning: Prediction of Chronic Kidney Disease Status - Bioinformatics, Data Science and Web Development

Introduction

Biomedical application of machine learning to predict the disease status of patients is a rapidly evolving field. Many see machine learning as a useful tool that could complement the efforts of health practitioners to ensure early detection and adoption of appropriate treatments to avert the escalation of diseases and save patients’ lives.

This tutorial in which I utilize support vector machine (SVM) and random forest implemented in R programming language to predict the kidney disease status of patients was a project I submitted in my first semester bioinformatics theory and application course. It encompasses data visualization, processing, and prediction of the status of a patient’s kidney disease condition (0-notchronic, 1-chronic).

Data visualization

The data used for this project is accessible at https://github.com/klarrey/Machine-learning-prediction-of-chronic-kidney-diseases-status-using-random-forest-and-SVM/blob/main/ckd_clean.csv.

Let’s load the data and have a quick glance at it

#Load required libraries
library(psych)
library(tidyverse)

#Read the data
ckd <- read.csv("D:/tutorials/project/ckd_clean.csv")
#show all the variables and their type
str(ckd)

#summarize and generate a pie chart of the proportion of #chronic and non-chronic patients in the data
dist_class <- c(count(subset(ckd, Class == 1)),count(subset(ckd, Class == 0)))
dist_class <- unlist(dist_class)
lbls <- c("Chronic", "Not.Chronic")
piepercent<- round(100*dist_class/sum(dist_class))
lbls <- paste(piepercent,"%",sep="") # add % to labels
color <- c("orange","green")
png(file="dist_class2.png")
pie(dist_class, labels=lbls, col = color, main="Pie Chart of Class Distribution")
legend("topright", c("Chronic", "Not Chronic"), cex=1.0,fill=color)
dev.off()

#Generate bar chart of summerized data
lbls_bar=c("Chronic","Not Chronic")
color <- c("red","blue")
png(file="dist_class2-bar.png")
barplot(dist_class, names.arg=lbls_bar, col=color, main="Bar Chart of Class Distribution")
legend("topright", c("Chronic", "Not Chronic"), cex=1.0,fill=color)
dev.off()

#load library for generating correlation matrix and plot 
library("corrplot")
ckdInt <- ckd[,c(1:5,10:18,25)]
mcor <- cor(ckdInt)
#Generate plot
png(file="corplot2.png")
corrplot(mcor, method="shade", shade.col=NA, tl.col="black", tl.srt=45, addCoef.col= "black", order="AOE")
dev.off()

Support Vector Machine (SVM)

Data processing

SVM package requires all values in the data to be numeric. Let’s change all the variables with string data types to integer values such as 0, 1, 2, or 3 according to the different values.

#Replace all characters with numbers
ckd[ckd$Red.Blood.Cells == "normal",]$Red.Blood.Cells <- 0
ckd[ckd$Red.Blood.Cells == "abnormal",]$Red.Blood.Cells <- 1
ckd[ckd$Pus.Cell == "normal",]$Pus.Cell <- 0
ckd[ckd$Pus.Cell == "abnormal",]$Pus.Cell <- 1
ckd[ckd$Pus.Cell.clumps == "notpresent",]$Pus.Cell.clumps <- 0
ckd[ckd$Pus.Cell.clumps == "present",]$Pus.Cell.clumps <- 1
ckd[ckd$Bacteria == "notpresent",]$Bacteria <- 0
ckd[ckd$Bacteria == "present",]$Bacteria <- 1
ckd[ckd$Hypertension == "no",]$Hypertension <- 0
ckd[ckd$Hypertension == "yes",]$Hypertension <- 1
ckd[ckd$Diabetes.Mellitus == "no",]$Diabetes.Mellitus <- 0
ckd[ckd$Diabetes.Mellitus == "yes",]$Diabetes.Mellitus <- 1
ckd[ckd$Coronary.Artery.Disease == "no",]$Coronary.Artery.Disease <- 0
ckd[ckd$Coronary.Artery.Disease == "yes",]$Coronary.Artery.Disease <- 1
ckd[ckd$Appetite == "good",]$Appetite <- 0
ckd[ckd$Appetite == "poor",]$Appetite <- 1
ckd[ckd$Pedal.Edema == "no",]$Pedal.Edema <- 0
ckd[ckd$Pedal.Edema == "yes",]$Pedal.Edema <- 1
ckd[ckd$Anemia == "no",]$Anemia <- 0
ckd[ckd$Anemia == "yes",]$Anemia <- 1
ckd[ckd$Class == 0,]$Class <- "notckd"
ckd[ckd$Class == 1,]$Class <- "ckd"

#Convert all variables from string data type to integer
ckd$Red.Blood.Cells <- as.integer(ckd$Red.Blood.Cells)
ckd$Pus.Cell <- as.integer(ckd$Pus.Cell)
ckd$Pus.Cell.clumps <- as.integer(ckd$Pus.Cell.clumps)
ckd$Bacteria <- as.integer(ckd$Bacteria)
ckd$Hypertension <- as.integer(ckd$Hypertension)
ckd$Diabetes.Mellitus <- as.integer(ckd$Diabetes.Mellitus)
ckd$Coronary.Artery.Disease <- as.integer(ckd$Coronary.Artery.Disease)
ckd$Appetite <- as.integer(ckd$Appetite)
ckd$Pedal.Edema <- as.integer(ckd$Pedal.Edema)
ckd$Anemia <- as.integer(ckd$Anemia)
ckd$Specific.Gravity <- as.integer(ckd$Specific.Gravity)
ckd$Albumin <- as.integer(ckd$Albumin)
ckd$Sugar <- as.integer(ckd$Sugar)

When developing a machine learning model, it is appropriate to put the data into 2 and use the largest part to train the model and the other for testing. This is important, for us to assess the performance of the model to ensure it is reliable before we deploy it for our desired task.

Now, let’s divide the data into training (75%) and testing (25%) data. We will train our model with the training data and later test it with the test data.

#Load required package and split the data into training and #test data
require(caTools)
set.seed(101)
sample = sample.split(ckd$Class, SplitRatio = .75)
train = subset(ckd, sample == TRUE)
test  = subset(ckd, sample == FALSE)

#Select features
train.feature = as.matrix(train[,1:24])
rownames (train.feature) = train[,25]

##Select Class
kidneyLabel = factor(train[,25], level = c("ckd", "notckd"))

#Train the svm model with the training data
model <- svm(train[, -25], kidneyLabel, kernel = "linear", scale = TRUE, 
             type = "C-classification")


#Print summary of the model
summary (model)

Linear kernel was used to perform a classification task on a data set with 2 classes.

Model testing

Below, we use the testing data set to test our model to assess its performance.

#Test the model
preds <- predict(model, test[, -25])

#create a new variable in the test data called preds and take a #look
test$preds <- preds
View(test)

#Generate confusing matrix
(CFM <- table(test$Class, test$preds))

Assessment of model performance

To assess the performance of a model, the accuracy, sensitivity and specificity are parameters often estimated. Below, let’s estimate the accuracy, sensitivity and specificity of the model we just developed.

#model performance
(accuracy <- paste((((9+29)/(9+29+2+0)) * 100), "%", sep=""))
(sensitivity <- paste(((9/(9+2)) * 100), "%", sep=""))
(specificity <- paste(((29/(29+0)) * 100), "%", sep=""))

The SVM model developed for predicting the kidney disease status of patients recorded a high accuracy, 95%, fairly good sensitivity, 82% and a perfect specificity, 100% on the test data.

Random Forest

Unlike the SVM package (e1071), the randomForest package requires minimal data processing. The package accepts both integer and string data types. We will begin by loading our data again. The only change we will make to the data is to change the Class variable from integer type (0 and 1) to string data type (ckd and notckd).

#Read data
ckd <- read.csv("D:/tutorials/project/ckd_clean.csv")

#replace 0 and 1 with notckd and ckd respectively and convert #to factor
ckd[ckd$Class == 0,]$Class <- "notckd"
ckd[ckd$Class == 1,]$Class <- "ckd"
ckd$Class <- as.factor(ckd$Class)

#load package we need to split our data into 2
require(caTools)

set.seed(101) #Set seed

#Split the data into training and test sets
sample = sample.split(ckd$Class, SplitRatio = .75)
train = subset(ckd, sample == TRUE)
test  = subset(ckd, sample == FALSE)


#load package for creating the random forest model
library(randomForest)

# Create the model with the training data
model <- randomForest(as.factor(Class) ~., data = train)

# print a summary of the model
print(model) 

#Testing the model with our test data
Class_pred <- predict(model, test[,-25])


#add our prediction to the test data and take a look
test$Class_pred <- Class_pred
View(test)


#create a confusion matrix
CFM <- table(test$Class, test$Class_pred)

#Assess model performance on the test data
(accuracy <- paste((11+29)*100/(11+29+0+0), "%", sep=""))
(sensitivity <- paste((11)*100/(11+0), "%", sep=""))
(specifiicity <- paste((29)*100/(29+0), "%", sep=""))

The random forest model achieved 100% performance in correctly classifying the observations in the test data, on all three parameters, accuracy, sensitivity and specificity.

Conclusion

In this project, we have successfully implemented two models, SVM and random forest. While the random forest achieved perfect accuracy, sensitivity and specificity, the SVM model had a few inefficiencies, recording approximately 82% on sensitivity, 95% on accuracy and 100% on specificity on a test data.