The project is on determining certain diagnostic measures for diabetes. The goal of the data is to identify which diagnostic measures are correlated with diabetes then, training a linear model to identify if a person has diabetes. The dependent variable in the project would be if the person has diabetes and the predictor variables that I will be using are glucose levels, blood pressure levels, skin thickness, insulin, body mass index (BMI), diabetes pedigree function and age. Hence, the problem that I am trying to solve is whether glucose levels, blood pressure levels, skin thickness, insulin, body mass index (BMI), diabetes pedigree function and age shows if a person has diabetes. Suppose the following:
\(\beta_0\) = glucose levels
\(\beta_1\) = blood pressure
levels
\(\beta_2\) = skin thickness
\(\beta_3\) = insulin
\(\beta_4\) = BMI
\(\beta_5\) = diabetes pedigree
function
\(\beta_6\) = age
library(RMySQL)
library(caret)
library(GGally)
library(tidyverse)
library(formatR)
library(shiny)
library(rpart)
library(tidymodels)
In this step, we establish a MySQL connection with the database so that we can make queries later on.
mysqlconnection <- dbConnect(RMySQL::MySQL(),
dbname='final',
host='localhost',
port=3306,
user='root',
password='CMSC398e')
Making a SQL query with the connection that was set up to get the data from the database and storing that data in a R data frame. The data set used for this analysis is obtained from “https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset?datasetId=2527538”
For the query, we are only getting the columns that are used in the analysis which are the dependent variable and the predictor variables. As for cleaning up the data, we handle missing data by simply removing that entry. Missing data are either values that have a value of 0 in any of the \(\beta_i\) values
query <- "
DELETE FROM diabetes WHERE `Glucose` = 0 OR`BloodPressure` = 0 OR `SkinThickness` = 0 OR `Insulin` = 0 OR `BMI` = 0 OR `DiabetesPedigreeFunction` = 0 OR `Age` = 0;
"
result <- dbSendQuery(mysqlconnection, query)
query <- "
SELECT `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`, `DiabetesPedigreeFunction`, `Age`, `Outcome` FROM diabetes;
"
result <- dbSendQuery(mysqlconnection, query)
df <- fetch(result)
result <- dbClearResult(result)
head(df)
## Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
## 1 89 66 23 94 28.1 0.167 21
## 2 137 40 35 168 43.1 2.288 33
## 3 78 50 32 88 31.0 0.248 26
## 4 197 70 45 543 30.5 0.158 53
## 5 189 60 23 846 30.1 0.398 59
## 6 166 72 19 175 25.8 0.587 51
## Outcome
## 1 0
## 2 1
## 3 1
## 4 1
## 5 1
## 6 1
One of the biggest indicator for diabetes would probably be the glucose levels. We can visualize this by looking at the average glucose levels of people with diabetes and people without diabetes.
query <- "
WITH T1 AS (
SELECT `Outcome`, AVG(`Glucose`) OVER(PARTITION BY `Outcome`) AS `AvgGlucose` FROM diabetes GROUP BY `Outcome`
)
SELECT * FROM T1;
"
result <- dbSendQuery(mysqlconnection, query)
df_avg <- fetch(result)
result <- dbClearResult(result)
df_avg
## Outcome AvgGlucose
## 1 0 89
## 2 1 137
We can see that the average glucose levels for people without diabetes is around 89mg/dl while the average glucose levels for people with diabetes is around 137mg/dl which is significantly higher. This would make us believe that having high glucose levels will be a diagnostic measure for diabetes. Now, we would like to compare the top 10 highest glucose levels for people without diabetes and top 10 lowest glucose levels for people with diabetes.
query <- "
WITH T1 AS (
SELECT `Outcome`, `Glucose`, DENSE_RANK() OVER (ORDER BY `Glucose`) AS `Rank` FROM diabetes WHERE `Outcome` = 1
)
SELECT * FROM T1 WHERE `Rank` <= 10;
"
result <- dbSendQuery(mysqlconnection, query)
df_diabetes <- fetch(result)
result <- dbClearResult(result)
df_diabetes
## Outcome Glucose Rank
## 1 1 78 1
## 2 1 80 2
## 3 1 88 3
## 4 1 92 4
## 5 1 93 5
## 6 1 95 6
## 7 1 95 6
## 8 1 97 7
## 9 1 100 8
## 10 1 100 8
## 11 1 100 8
## 12 1 102 9
## 13 1 104 10
## 14 1 104 10
We can see that the 10 highest glucose levels for people without diabetes are [197, 193, 191, 180, 173, 165, 165, 165, 164, 158]. All of these values are significantly higher than the average glucose levels of people with and without diabetes.
query <- "
WITH T1 AS (
SELECT `Outcome`, `Glucose`, RANK() OVER (ORDER BY `Glucose` DESC) AS `Rank` FROM diabetes WHERE `Outcome` = 0
)
SELECT * FROM T1 WHERE `Rank` <= 10;
"
result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
df_nondiabetes
## Outcome Glucose Rank
## 1 0 197 1
## 2 0 193 2
## 3 0 191 3
## 4 0 180 4
## 5 0 173 5
## 6 0 165 6
## 7 0 165 6
## 8 0 165 6
## 9 0 164 9
## 10 0 158 10
Here, we can see that the 10 lowest glucose levels for people with diabetes are [78, 80, 88, 92, 93, 95, 95, 97, 100, 100]. All these are much lower than the average glucose levels of people with diabetes. Besides that, the top 3 lowest glucose levels for people with diabetes are lower than the average glucose levels of people without diabetes. Hence, we should look at the cumulative distribution for the glucose levels.
query <- "
WITH T1 AS (
SELECT `Outcome`, `Glucose`, ROUND(CUME_DIST() OVER (ORDER BY `Glucose`), 2) AS `CumDist` FROM diabetes WHERE `Outcome` = 0
)
SELECT * FROM T1 WHERE `Glucose` >= 137;
"
result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
head(df_nondiabetes)
## Outcome Glucose CumDist
## 1 0 137 0.85
## 2 0 139 0.86
## 3 0 139 0.86
## 4 0 139 0.86
## 5 0 139 0.86
## 6 0 140 0.87
Comparing the cumulative distribution of the glucose levels of people without diabetes as a whole to the average glucose level of people with diabetes, 85% of the values are less than or equal to it which is 137mg/dl.
query <- "
WITH T1 AS (
SELECT `Outcome`, `Glucose`, ROUND(CUME_DIST() OVER (ORDER BY `Glucose`), 2) AS `CumDist` FROM diabetes WHERE `Outcome` = 1
)
SELECT * FROM T1 WHERE `Glucose` < 89;
"
result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
df_nondiabetes
## Outcome Glucose CumDist
## 1 1 78 0.01
## 2 1 80 0.02
## 3 1 88 0.02
On the other hand, comparing the cumulative distribution of the glucose levels of people with diabetes as a whole to the average glucose level of people without diabetes, only around 2% of the values are less than or equal to it which is 89mg/dl.
Using just the average, it is hard to show whether blood glucose levels is a good diagnostic measure for diabetes. This would apply to all the other predictor variables. Therefore, we can use a linear regression to see if there is any correlation between the predictor variables and dependent variable.
The hypothesis for the analysis are
\(H_0\): \(\beta_0 = \beta_1 = \beta_2 = \beta_3 = \beta_4 =
\beta_5 = \beta_6\)
\(H_A\): At least one of these \(\beta_i\) is not 0
We can use some plots to visualize relationships of our variables. We can use a scatter plot to see if there is a relationship between the insulin levels and glucose levels.
plot(df$Insulin, df$Glucose, main = "Scatterplot of Glucose Levels Against Insulin Levels", xlab = "Glucose Levels (mg/dl)", ylab = "Insulin Levels (mIU/L)")
From the scatter plot, it suggests that there might be a linear
relationship between insulin levels and having diabetes. The values of
insulin levels tend to increase as the glucose levels increases.
Besides that, we can use a histogram to see if the distribution of a
variable and if it has any outliers. We will be using a histogram on the
variable, BMI.
hist(df$BMI, main = "Histogram of BMI", xlab = "BMI")
The histogram for BMI tells us that the distribution for BMI in our data is right skewed and there is likely outliers present.
In order to use a linear model to determine if we should reject our null hypothesis or fail to reject it, we would first need to train our model and test it. Therefore, we would need to split our data into a train and test set and we would be using a 80/20 split in our case.
data_split <- df %>% rsample::initial_split(
data = ,
prop = 0.8
)
training_data <- training(data_split)
test_data <- testing(data_split)
Now that we have our training and test data, we are ready to train our linear model
linear_model <- lm(`Outcome` ~ ., data = training_data)
Using our trained linear model, we can make predictions on the test set and see how accurate it’s predictions are
predictions <- predict(linear_model, newdata = test_data)
summary(lm(test_data$Outcome~predictions))
##
## Call:
## lm(formula = test_data$Outcome ~ predictions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92421 -0.24556 -0.09199 0.30179 0.83193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.03783 0.06705 0.564 0.574
## predictions 0.78448 0.14062 5.579 3.46e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3976 on 77 degrees of freedom
## Multiple R-squared: 0.2878, Adjusted R-squared: 0.2786
## F-statistic: 31.12 on 1 and 77 DF, p-value: 3.462e-07
Based on the results, with a significance level of 95% we reject our null hypothesis in favor of our alternative hypothesis.
In conclusion, there is enough evidence to support our claim that at least one of the predictor variables are not a diagnostic measure for diabetes. Although they are not a diagnostic measure for diabetes but some of the measures could still be signs or risk factor for diabetes. More analysis would be needed to prove that that statement.