Introduction

The project is on determining certain diagnostic measures for diabetes. The goal of the data is to identify which diagnostic measures are correlated with diabetes then, training a linear model to identify if a person has diabetes. The dependent variable in the project would be if the person has diabetes and the predictor variables that I will be using are glucose levels, blood pressure levels, skin thickness, insulin, body mass index (BMI), diabetes pedigree function and age. Hence, the problem that I am trying to solve is whether glucose levels, blood pressure levels, skin thickness, insulin, body mass index (BMI), diabetes pedigree function and age shows if a person has diabetes. Suppose the following:

\(\beta_0\) = glucose levels
\(\beta_1\) = blood pressure levels
\(\beta_2\) = skin thickness
\(\beta_3\) = insulin
\(\beta_4\) = BMI
\(\beta_5\) = diabetes pedigree function
\(\beta_6\) = age

Importing the libraries used

library(RMySQL)
library(caret)
library(GGally)
library(tidyverse)
library(formatR)
library(shiny)
library(rpart)
library(tidymodels)

Setting Up MySQL Connection

In this step, we establish a MySQL connection with the database so that we can make queries later on.

mysqlconnection <- dbConnect(RMySQL::MySQL(),
                            dbname='final',
                            host='localhost',
                            port=3306,
                            user='root',
                            password='CMSC398e')

Obtaining and Tidying The Data

Making a SQL query with the connection that was set up to get the data from the database and storing that data in a R data frame. The data set used for this analysis is obtained from “https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset?datasetId=2527538

For the query, we are only getting the columns that are used in the analysis which are the dependent variable and the predictor variables. As for cleaning up the data, we handle missing data by simply removing that entry. Missing data are either values that have a value of 0 in any of the \(\beta_i\) values

query <- "
  DELETE FROM diabetes WHERE `Glucose` = 0 OR`BloodPressure` = 0 OR `SkinThickness` = 0 OR `Insulin` = 0 OR `BMI` = 0 OR `DiabetesPedigreeFunction` = 0 OR `Age` = 0;
"

result <- dbSendQuery(mysqlconnection, query)

query <- "
  SELECT `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`, `DiabetesPedigreeFunction`, `Age`, `Outcome`  FROM diabetes;
"

result <- dbSendQuery(mysqlconnection, query)
df <- fetch(result)
result <- dbClearResult(result)
head(df)
##   Glucose BloodPressure SkinThickness Insulin  BMI DiabetesPedigreeFunction Age
## 1      89            66            23      94 28.1                    0.167  21
## 2     137            40            35     168 43.1                    2.288  33
## 3      78            50            32      88 31.0                    0.248  26
## 4     197            70            45     543 30.5                    0.158  53
## 5     189            60            23     846 30.1                    0.398  59
## 6     166            72            19     175 25.8                    0.587  51
##   Outcome
## 1       0
## 2       1
## 3       1
## 4       1
## 5       1
## 6       1

SQL Analysis

One of the biggest indicator for diabetes would probably be the glucose levels. We can visualize this by looking at the average glucose levels of people with diabetes and people without diabetes.

query <- "
  WITH T1 AS (
    SELECT `Outcome`, AVG(`Glucose`) OVER(PARTITION BY `Outcome`) AS `AvgGlucose` FROM diabetes GROUP BY `Outcome`
  )
  SELECT * FROM T1;
"

result <- dbSendQuery(mysqlconnection, query)
df_avg <- fetch(result)
result <- dbClearResult(result)
df_avg
##   Outcome AvgGlucose
## 1       0         89
## 2       1        137

We can see that the average glucose levels for people without diabetes is around 89mg/dl while the average glucose levels for people with diabetes is around 137mg/dl which is significantly higher. This would make us believe that having high glucose levels will be a diagnostic measure for diabetes. Now, we would like to compare the top 10 highest glucose levels for people without diabetes and top 10 lowest glucose levels for people with diabetes.

query <- "
  WITH T1 AS (
    SELECT `Outcome`, `Glucose`, DENSE_RANK() OVER (ORDER BY `Glucose`) AS `Rank` FROM diabetes WHERE `Outcome` = 1
  )
  SELECT * FROM T1 WHERE `Rank` <= 10;
"

result <- dbSendQuery(mysqlconnection, query)
df_diabetes <- fetch(result)
result <- dbClearResult(result)
df_diabetes
##    Outcome Glucose Rank
## 1        1      78    1
## 2        1      80    2
## 3        1      88    3
## 4        1      92    4
## 5        1      93    5
## 6        1      95    6
## 7        1      95    6
## 8        1      97    7
## 9        1     100    8
## 10       1     100    8
## 11       1     100    8
## 12       1     102    9
## 13       1     104   10
## 14       1     104   10

We can see that the 10 highest glucose levels for people without diabetes are [197, 193, 191, 180, 173, 165, 165, 165, 164, 158]. All of these values are significantly higher than the average glucose levels of people with and without diabetes.

query <- "
  WITH T1 AS (
    SELECT `Outcome`, `Glucose`, RANK() OVER (ORDER BY `Glucose` DESC) AS `Rank` FROM diabetes WHERE `Outcome` = 0
  )
  SELECT * FROM T1 WHERE `Rank` <= 10;
"

result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
df_nondiabetes
##    Outcome Glucose Rank
## 1        0     197    1
## 2        0     193    2
## 3        0     191    3
## 4        0     180    4
## 5        0     173    5
## 6        0     165    6
## 7        0     165    6
## 8        0     165    6
## 9        0     164    9
## 10       0     158   10

Here, we can see that the 10 lowest glucose levels for people with diabetes are [78, 80, 88, 92, 93, 95, 95, 97, 100, 100]. All these are much lower than the average glucose levels of people with diabetes. Besides that, the top 3 lowest glucose levels for people with diabetes are lower than the average glucose levels of people without diabetes. Hence, we should look at the cumulative distribution for the glucose levels.

query <- "
  WITH T1 AS (
    SELECT `Outcome`, `Glucose`, ROUND(CUME_DIST() OVER (ORDER BY `Glucose`), 2) AS `CumDist` FROM diabetes WHERE `Outcome` = 0
  )
  SELECT * FROM T1 WHERE `Glucose` >= 137;
"

result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
head(df_nondiabetes)
##   Outcome Glucose CumDist
## 1       0     137    0.85
## 2       0     139    0.86
## 3       0     139    0.86
## 4       0     139    0.86
## 5       0     139    0.86
## 6       0     140    0.87

Comparing the cumulative distribution of the glucose levels of people without diabetes as a whole to the average glucose level of people with diabetes, 85% of the values are less than or equal to it which is 137mg/dl.

query <- "
  WITH T1 AS (
    SELECT `Outcome`, `Glucose`, ROUND(CUME_DIST() OVER (ORDER BY `Glucose`), 2) AS `CumDist` FROM diabetes WHERE `Outcome` = 1
  )
  SELECT * FROM T1 WHERE `Glucose` < 89;
"

result <- dbSendQuery(mysqlconnection, query)
df_nondiabetes <- fetch(result)
result <- dbClearResult(result)
df_nondiabetes
##   Outcome Glucose CumDist
## 1       1      78    0.01
## 2       1      80    0.02
## 3       1      88    0.02

On the other hand, comparing the cumulative distribution of the glucose levels of people with diabetes as a whole to the average glucose level of people without diabetes, only around 2% of the values are less than or equal to it which is 89mg/dl.

R Analysis

Using just the average, it is hard to show whether blood glucose levels is a good diagnostic measure for diabetes. This would apply to all the other predictor variables. Therefore, we can use a linear regression to see if there is any correlation between the predictor variables and dependent variable.

The hypothesis for the analysis are

\(H_0\): \(\beta_0 = \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5 = \beta_6\)
\(H_A\): At least one of these \(\beta_i\) is not 0

We can use some plots to visualize relationships of our variables. We can use a scatter plot to see if there is a relationship between the insulin levels and glucose levels.

plot(df$Insulin, df$Glucose, main = "Scatterplot of Glucose Levels Against Insulin Levels", xlab = "Glucose Levels (mg/dl)", ylab = "Insulin Levels (mIU/L)")

From the scatter plot, it suggests that there might be a linear relationship between insulin levels and having diabetes. The values of insulin levels tend to increase as the glucose levels increases.
Besides that, we can use a histogram to see if the distribution of a variable and if it has any outliers. We will be using a histogram on the variable, BMI.

hist(df$BMI, main = "Histogram of BMI", xlab = "BMI")

The histogram for BMI tells us that the distribution for BMI in our data is right skewed and there is likely outliers present.

In order to use a linear model to determine if we should reject our null hypothesis or fail to reject it, we would first need to train our model and test it. Therefore, we would need to split our data into a train and test set and we would be using a 80/20 split in our case.

data_split <- df %>% rsample::initial_split(
  data = ,
  prop = 0.8
)
training_data <- training(data_split)
test_data <- testing(data_split)

Now that we have our training and test data, we are ready to train our linear model

linear_model <- lm(`Outcome` ~ ., data = training_data)

Using our trained linear model, we can make predictions on the test set and see how accurate it’s predictions are

predictions <- predict(linear_model, newdata = test_data)
summary(lm(test_data$Outcome~predictions))
## 
## Call:
## lm(formula = test_data$Outcome ~ predictions)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92421 -0.24556 -0.09199  0.30179  0.83193 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.03783    0.06705   0.564    0.574    
## predictions  0.78448    0.14062   5.579 3.46e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3976 on 77 degrees of freedom
## Multiple R-squared:  0.2878, Adjusted R-squared:  0.2786 
## F-statistic: 31.12 on 1 and 77 DF,  p-value: 3.462e-07

Based on the results, with a significance level of 95% we reject our null hypothesis in favor of our alternative hypothesis.

Conclusion

In conclusion, there is enough evidence to support our claim that at least one of the predictor variables are not a diagnostic measure for diabetes. Although they are not a diagnostic measure for diabetes but some of the measures could still be signs or risk factor for diabetes. More analysis would be needed to prove that that statement.