Chronic kidney disease (CKD) is a type of kidney disease in which there is huge loss of kidney function over a long period might be months to years. In this project we used different models and compared between these models to predict the chronic kidney disease .
In our data we found that colmn 21 have alot of missing data so we remove it from our data
Data imputation
In chronic_kidney_disease_full1.csv file the missing data is replaced with a ‘?’ so to deal with these data we must convert it to ‘NAN’
Data has two types : numeric and catergorical data
We will replace any missing data with the mean of this feature data if it has numeric data ,But when it has catergorical we will replace the missing data with the mode of this feature.
Feature normalization
Because of the increasing in error when the range of the numeric data in each column is different in the width we made feature scalling to decrease this error.
KNN is the simplest algorithm with a very high accuracy for binary classification, It classifies a new data point based on the similarity.
To apply KNN model we must select K of neighbors after calculating distance between all points and the new data point , It will take the closest k points compared to the new point then it classifies the new point based on the majority. for example: if the closest 5 points 3 of them from A category and 2 from B category ,So this point will be classified as A category.
code
This function will build the model and calculate accuracy, sensitivty and specifity for each fold, Then we put the output in a vector called statistics
lapply function will return the output as a list so to deal with it , convert cv_KNN into data frame
Then to show the results , Use paste function .
The results:
2. Deision Tree
It is an algorithm used to classify data. This algorithm uses the principle of divide and conquer to divide the problem into parts and solve all of them separately, Then the solution is grouped . The decision tree is created based on the choice of the best attribute, The training set can be divide so that the depth of the tree decreases at the same time as the data is correctly categorized. For more illustration you can check this link
code
The results:
3. Logistic regression
First of all to understand logistic regression you must pass by Linear regression As you have seen in linear regression , we used a hypothesis relation to predict the output but this time we need that our predictions be the binary outcome of either 0 or 1. So, we use the same hypothesis but with a little modification by using the sigmoid function.
As seen in the figure above , The sigmoid function changed the output into a range between zero and one. To predict whether the output is one or zero ,we need to set a boundary value which is set by default equals to 0.5 , when the output >= 0.5 then it is rounded up to 1 and if it is < 0.5 then it is rounded down to 0. For further details please check : Logistic Regression
Code
The results:
4. Naive Bayes
This method is based on Bayes Theorem in statistics . It considers two main assumptions ,The first one is that every feature is independent of the other features , The second one is that every feature has the same contribution (Weight) as the others . It calculates the probability of each feature individually given the output and the probability of the output and hence be able to predict the output given the features values . As mentioned in the Logistic regression method we are going to use the same boundary value . For Further details for the algorithm Check Naïve Bayes
Code
The results:
Conclusion:
From the shown results , Although each method has a good accuracy , you still need to study for the suitable method that goes along best with your data . In the end we now can highly predict chronic kidney diseadse .