by Yuan Shen Tay
Over the years, living cost has been increasing throughout the country which lead to the question on how much income is needed in order to sustain. The living cost varies from state to state and even from county to county due to the difference in housing prices and cost of basic necessities.
Through my tutorial, I will be looking at the income for each county across Maryland. I will analyze the trend on the income for each county and predict the income for Maryland as a whole. Through my analysis, I will also see if there is any correlation between poverty rates and income.
Before you start the analysis, we would need to import some libraries that contain tools which we need and will help us carry out the analysis. The libraries used in this project are:
pandas - Pandas has the tools needed for data analysis and manipulation mainly the dataframes
numpy - Numpy is a scientific computing library that we can use on large multidimensional arrays
matplotlib - Matplotlib is a plotting library for us to plot and visualize our data
sklearn - SciKit Learn is a Machine Learning library that large number of models where we can use to classify our data
statsmodels - statsmodels contains functions that can be used to estimate statistical models and conducting tests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import ols
/opt/conda/lib/python3.9/site-packages/statsmodels/compat/pandas.py:65: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index as NumericIndex
The first stage of the data life cycle is collecting data. The dataset used is obtained from Maryland state open data website. The links for the dataset are:
https://opendata.maryland.gov/Demographic/Maryland-Per-Capita-Personal-Income-Constant-2012-/q4mi-9fr9
https://opendata.maryland.gov/Planning/Poverty-Rate-With-Margin-Of-Error-2010-2019/iudf-4y2j
https://opendata.maryland.gov/Demographic/Maryland-Median-Household-Income-By-Year-With-Marg/bvk4-qsxs
The website already has Application Programming Interface (API) which allows to directly connect with the websites and obtain the csv files which contains the data. The dataset are we are using contains the income per capita, poverty rate and median household income for each county in Maryland.
income_per_capita = pd.read_csv('https://opendata.maryland.gov/resource/q4mi-9fr9.csv')
income_per_capita.head()
date_created | year | maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | carroll_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | September 29, 2020 | 2010 | 52251 | 33436 | 55360 | 39699 | 51519 | 52421 | 36012 | 50397 | ... | 47491 | 74028 | 42782 | 51842 | 26854 | 49021 | 59672 | 38740 | 36230 | 46449 |
1 | September 29, 2020 | 2011 | 53432 | 33891 | 56884 | 40923 | 51530 | 53383 | 37370 | 51574 | ... | 48498 | 76529 | 43336 | 53009 | 26897 | 49897 | 60929 | 39419 | 36134 | 47180 |
2 | September 29, 2020 | 2012 | 53547 | 33946 | 57182 | 40744 | 51982 | 53326 | 38773 | 51859 | ... | 48590 | 76901 | 42842 | 53617 | 26830 | 49300 | 60548 | 39822 | 35419 | 48977 |
3 | September 29, 2020 | 2013 | 52352 | 34049 | 56537 | 41156 | 51151 | 52177 | 39330 | 51601 | ... | 48917 | 72577 | 42140 | 53209 | 27756 | 48499 | 60864 | 39646 | 35649 | 48894 |
4 | September 29, 2020 | 2014 | 53170 | 34808 | 57551 | 42857 | 52331 | 52948 | 39790 | 52960 | ... | 50625 | 72746 | 42425 | 54075 | 28881 | 49133 | 62278 | 40548 | 37043 | 49840 |
5 rows × 27 columns
poverty_rate = pd.read_csv('https://opendata.maryland.gov/resource/iudf-4y2j.csv')
poverty_rate.head()
date_created | year | estimate | maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2020-09-29T00:00:00.000 | 2010 | Poverty Rate | 9.9 | 17.1 | 6.6 | 24.7 | 8.2 | 6.2 | 13.0 | ... | 14.2 | 7.5 | 9.4 | 7.3 | 19.3 | 7.5 | 9.7 | 11.4 | 16.6 | 10.6 |
1 | 2020-09-29T00:00:00.000 | 2010 | MOE | 0.3 | 3.0 | 1.1 | 1.8 | 1.2 | 1.4 | 2.8 | ... | 3.1 | 0.8 | 1.2 | 1.7 | 5.5 | 1.9 | 2.2 | 2.1 | 2.8 | 2.7 |
2 | 2020-09-29T00:00:00.000 | 2011 | Poverty Rate | 10.2 | 19.1 | 6.5 | 24.5 | 9.6 | 6.1 | 13.1 | ... | 13.9 | 6.7 | 9.4 | 8.7 | 26.2 | 8.6 | 10.8 | 11.8 | 17.7 | 13.0 |
3 | 2020-09-29T00:00:00.000 | 2011 | MOE | 0.3 | 3.4 | 1.2 | 1.7 | 1.2 | 1.5 | 2.7 | ... | 3.4 | 0.8 | 1.0 | 1.7 | 6.0 | 1.8 | 2.2 | 1.7 | 2.6 | 2.6 |
4 | 2022-04-08T00:00:00.000 | 2012 | Poverty Rate | 10.4 | 18.1 | 6.3 | 24.5 | 9.7 | 7.0 | 15.7 | ... | 14.0 | 6.6 | 10.3 | 8.2 | 29.6 | 8.4 | 9.7 | 13.7 | 16.7 | 11.1 |
5 rows × 28 columns
median_income = pd.read_csv('https://opendata.maryland.gov/resource/bvk4-qsxs.csv')
median_income.head()
date_created | year | data | maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | September 29, 2020 | 2010 | Income | 68933 | 37083 | 80908 | 38186 | 62300 | 86536 | 55480 | ... | 49017 | 88559 | 69524 | 78503 | 38134 | 81559 | 56806 | 51610 | 47702 | 55492 |
1 | September 29, 2020 | 2010 | MOE | 833 | 2826 | 2311 | 1414 | 2006 | 5064 | 2965 | ... | 4582 | 2710 | 1609 | 5181 | 2747 | 5070 | 3948 | 3327 | 3097 | 3507 |
2 | September 29, 2020 | 2011 | Income | 70075 | 38504 | 82980 | 38478 | 62309 | 88406 | 50809 | ... | 49795 | 92288 | 70114 | 75158 | 35426 | 80943 | 55145 | 52028 | 45788 | 48472 |
3 | September 29, 2020 | 2011 | MOE | 760 | 2693 | 3430 | 1536 | 1728 | 4369 | 4213 | ... | 4603 | 2758 | 1911 | 6363 | 3426 | 2717 | 4929 | 2928 | 3582 | 4653 |
4 | September 29, 2020 | 2012 | Income | 71169 | 38670 | 87083 | 39077 | 62413 | 87215 | 48772 | ... | 49969 | 94365 | 69258 | 79012 | 34454 | 85478 | 61529 | 52604 | 50204 | 55875 |
5 rows × 28 columns
Now that we have our collected our data, the next step will be to tidy up our data which means that we would want to filter out everything which is not used in our analysis and to handle missing entries in our dataset.
We would need to check for any missing data in our datasets
income_per_capita.isna().sum()
date_created 0 year 0 maryland 0 allegany_county 0 anne_arundel_county 0 baltimore_city 0 baltimore_county 0 calvert_county 0 caroline_county 0 carroll_county 0 cecil_county 0 charles_county 0 dorchester_county 0 frederick_county 0 garrett_county 0 harford_county 0 howard_county 0 kent_county 0 montgomery_county 0 prince_george_s_county 0 queen_anne_s_county 0 somerset_county 0 st_mary_s_county 0 talbot_county 0 washington_county 0 wicomico_county 0 worcester_county 0 dtype: int64
poverty_rate.isna().sum()
date_created 0 year 0 estimate 0 maryland 0 allegany_county 0 anne_arundel_county 0 baltimore_city 0 baltimore_county 0 calvert_county 0 caroline_county 0 carroll_county 0 cecil_county 0 charles_county 0 dorchester_county 0 frederick_county 0 garrett_county 0 harford_county 0 howard_county 0 kent_county 0 montgomery_county 0 prince_george_s_county 0 queen_anne_s_county 0 somerset_county 0 st_mary_s_county 0 talbot_county 0 washington_county 0 wicomico_county 0 worcester_county 0 dtype: int64
median_income.isna().sum()
date_created 0 year 0 data 0 maryland 0 allegany_county 0 anne_arundel_county 0 baltimore_city 0 baltimore_county 0 calvert_county 0 caroline_county 0 carroll_county 0 cecil_county 0 charles_county 0 dorchester_county 0 frederick_county 0 garrett_county 0 harford_county 0 howard_county 0 kent_county 0 montgomery_county 0 prince_george_s_county 0 queen_anne_s_county 0 somerset_county 0 st_mary_s_county 0 talbot_county 0 washington_county 0 wicomico_county 0 worcester_county 0 dtype: int64
Fortunately, since the sum of missing entries is 0 for everything, we have no missing entries on our data. If there were missing entries, we can call the function dropna() to drop all missing entries from our data. However, it is not always the case to handle missing entries by just dropping them.
Next, we will be dropping off rows and columns that are not used. For the rows, we will not be using rows that are marked MOE in the poverty rate and median income tables as they are the margin of error. For the columns, we will only need the year and value of each county. So, we will be dropping all the other columns and setting the years to be the index
# dropping data from the income per capita table
income_per_capita = income_per_capita.drop(columns=['date_created'])
income_per_capita = income_per_capita.set_index('year')
income_per_capita.head()
maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | carroll_county | cecil_county | charles_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
2010 | 52251 | 33436 | 55360 | 39699 | 51519 | 52421 | 36012 | 50397 | 39607 | 49941 | ... | 47491 | 74028 | 42782 | 51842 | 26854 | 49021 | 59672 | 38740 | 36230 | 46449 |
2011 | 53432 | 33891 | 56884 | 40923 | 51530 | 53383 | 37370 | 51574 | 40235 | 50714 | ... | 48498 | 76529 | 43336 | 53009 | 26897 | 49897 | 60929 | 39419 | 36134 | 47180 |
2012 | 53547 | 33946 | 57182 | 40744 | 51982 | 53326 | 38773 | 51859 | 40299 | 50023 | ... | 48590 | 76901 | 42842 | 53617 | 26830 | 49300 | 60548 | 39822 | 35419 | 48977 |
2013 | 52352 | 34049 | 56537 | 41156 | 51151 | 52177 | 39330 | 51601 | 40262 | 49016 | ... | 48917 | 72577 | 42140 | 53209 | 27756 | 48499 | 60864 | 39646 | 35649 | 48894 |
2014 | 53170 | 34808 | 57551 | 42857 | 52331 | 52948 | 39790 | 52960 | 40944 | 49208 | ... | 50625 | 72746 | 42425 | 54075 | 28881 | 49133 | 62278 | 40548 | 37043 | 49840 |
5 rows × 25 columns
# dropping data from the poverty rate table
poverty_rate = poverty_rate.loc[poverty_rate['estimate'] == 'Poverty Rate']
poverty_rate = poverty_rate.drop(columns=['date_created', 'estimate'])
poverty_rate = poverty_rate.set_index('year')
poverty_rate.head()
maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | carroll_county | cecil_county | charles_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
2010 | 9.9 | 17.1 | 6.6 | 24.7 | 8.2 | 6.2 | 13.0 | 5.4 | 10.5 | 6.2 | ... | 14.2 | 7.5 | 9.4 | 7.3 | 19.3 | 7.5 | 9.7 | 11.4 | 16.6 | 10.6 |
2011 | 10.2 | 19.1 | 6.5 | 24.5 | 9.6 | 6.1 | 13.1 | 5.5 | 9.7 | 7.7 | ... | 13.9 | 6.7 | 9.4 | 8.7 | 26.2 | 8.6 | 10.8 | 11.8 | 17.7 | 13.0 |
2012 | 10.4 | 18.1 | 6.3 | 24.5 | 9.7 | 7.0 | 15.7 | 6.3 | 11.9 | 8.6 | ... | 14.0 | 6.6 | 10.3 | 8.2 | 29.6 | 8.4 | 9.7 | 13.7 | 16.7 | 11.1 |
2013 | 10.2 | 18.6 | 7.3 | 22.7 | 9.5 | 6.9 | 16.7 | 6.8 | 9.8 | 8.0 | ... | 14.9 | 7.0 | 9.9 | 8.4 | 28.5 | 8.2 | 10.9 | 12.0 | 16.5 | 13.1 |
2014 | 10.4 | 18.5 | 6.7 | 23.3 | 9.8 | 7.2 | 16.0 | 5.9 | 10.6 | 7.2 | ... | 13.8 | 7.2 | 10.3 | 7.5 | 25.5 | 8.6 | 11.7 | 13.8 | 16.9 | 11.9 |
5 rows × 25 columns
# dropping data from the median income table
median_income = median_income.loc[median_income['data'] == 'Income']
median_income = median_income.drop(columns=['date_created', 'data'])
median_income = median_income.set_index('year')
median_income.head()
maryland | allegany_county | anne_arundel_county | baltimore_city | baltimore_county | calvert_county | caroline_county | carroll_county | cecil_county | charles_county | ... | kent_county | montgomery_county | prince_george_s_county | queen_anne_s_county | somerset_county | st_mary_s_county | talbot_county | washington_county | wicomico_county | worcester_county | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
year | |||||||||||||||||||||
2010 | 68933 | 37083 | 80908 | 38186 | 62300 | 86536 | 55480 | 80291 | 61506 | 83078 | ... | 49017 | 88559 | 69524 | 78503 | 38134 | 81559 | 56806 | 51610 | 47702 | 55492 |
2011 | 70075 | 38504 | 82980 | 38478 | 62309 | 88406 | 50809 | 82553 | 61191 | 88575 | ... | 49795 | 92288 | 70114 | 75158 | 35426 | 80943 | 55145 | 52028 | 45788 | 48472 |
2012 | 71169 | 38670 | 87083 | 39077 | 62413 | 87215 | 48772 | 79304 | 62443 | 89203 | ... | 49969 | 94365 | 69258 | 79012 | 34454 | 85478 | 61529 | 52604 | 50204 | 55875 |
2013 | 72482 | 39994 | 85685 | 41988 | 64624 | 91993 | 46015 | 82073 | 64880 | 87577 | ... | 55695 | 97873 | 71682 | 80143 | 36106 | 78274 | 57525 | 55643 | 47536 | 52276 |
2014 | 73851 | 39808 | 86654 | 41895 | 67766 | 92446 | 49573 | 84500 | 62198 | 86703 | ... | 53288 | 97279 | 71904 | 80650 | 38376 | 84686 | 54836 | 54606 | 51927 | 55691 |
5 rows × 25 columns
Lastly, I will combine all the county data together to be represented in the same table using a MultiIndex which are the years and counties.
# combining all county data
# getting the years that we are analyzing
years = poverty_rate.index
# getting all the counties in Maryland
counties = income_per_capita.columns
counties = counties[1:]
all_data = pd.DataFrame()
index = [years, counties]
index = pd.MultiIndex.from_product(index, names = ['years', 'county'])
per_capita = []
median = []
poverty = []
for year in years:
per_capita.extend(income_per_capita.loc[year][1:].values)
median.extend(median_income.loc[year][1:].values)
poverty.extend(poverty_rate.loc[year][1:].values)
all_data['income_per_capita'] = per_capita
all_data['median_income'] = median
all_data['poverty_rate'] = poverty
all_data = all_data.set_index(index)
all_data.head()
income_per_capita | median_income | poverty_rate | ||
---|---|---|---|---|
years | county | |||
2010 | allegany_county | 33436 | 37083 | 17.1 |
anne_arundel_county | 55360 | 80908 | 6.6 | |
baltimore_city | 39699 | 38186 | 24.7 | |
baltimore_county | 51519 | 62300 | 8.2 | |
calvert_county | 52421 | 86536 | 6.2 |
Now that we have tidied up all our data, we are ready to start analyzing and visualize our data which is the next step in the data science pipeline.
To analyze the income trend, I will be looking at the median household income and income per capita for each county over the years 2010 and 2019 but first, we want to see the trend of income in Maryland as a whole.
# extracting the income per capita and median income for maryland
maryland_per_capita = income_per_capita['maryland']
maryland_median = median_income['maryland']
# setting the size of the graph
plt.figure(figsize=(10,10))
# plotting the income graph
plt.plot(years, maryland_per_capita, label="Income Per Capita")
plt.plot(years, maryland_median, label="Median Income")
plt.title('Income Graph of Maryland from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend()
plt.show()
As we can see there is an increasing trend in both the income per capita and median income in Maryland as a whole. We can also see that the median income is much higher than the income per capita which makes sense as the total population is taken into consideration for the calculations to find the income per capita.
Now, we want to visualzie the trend of income for each county in Maryland to see if all counties are having the same trends
# plotting the graph
plt.figure(figsize=(15,10))
for county in counties:
median = all_data.groupby(['county']).get_group(county)['median_income']
plt.plot(years, median, marker = 'o', label=county)
plt.title('Median Income Graph of Counties from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.1), ncol = 5)
plt.show()
Based on the median graph, we can see that not all counties have the same trend across the years and some counties. In the year 2019, Baltimore City, Somerset County, St. Mary's County and Washington County are showing a decreasing trend. Despite that, all counties do have a net increase in median income compared to 2010.
plt.figure(figsize=(15,10))
for county in counties:
per_capita = all_data.groupby(['county']).get_group(county)['income_per_capita']
plt.plot(years, per_capita, marker = 'o', label=county)
plt.title('Income Per Capita Graph of Counties from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Income')
plt.legend(loc=9, bbox_to_anchor=(0.5, -0.1), ncol = 5)
plt.show()
From the income per capita graph of the counties, all counties are showing some sort of increasing trend but at different magnitudes. Hence, due to the difference in trend, we might want to visualize the distribution of each county over the years to see if the different trend and magnitudes have any impact on the income distribution in Maryland.
# plotting a graph for each county
for year in years:
# extracting the income per capita and median income for the county
per_capita = all_data.groupby(['years']).get_group(year)['income_per_capita']
median = all_data.groupby(['years']).get_group(year)['median_income']
# setting the size of the graph
plt.figure(figsize=(10,10))
# plotting the income graph
plt.plot(counties, per_capita, label="Income Per Capita")
plt.plot(counties, median, label="Median Income")
plt.title('Income Graph in ' + str(year))
plt.xlabel('County')
plt.xticks(rotation = 90)
plt.ylabel('Income')
plt.legend()
plt.show()
Looking at the counties side by side, there is not much change in the shape of the graph over the years for both the median income and income per capita. This shows that despite the increase in income over time, the distribution of income across the counties are still the same. Besides that, this also shows that the difference in trends of each county over the years did not have much impact on the distribution of incomes across counties.
Hence, we can still say that there is an increasing trend of income in all counties in Maryland over the years.
First, we would want to look at the trend of poverty rate across years for Maryland.
# extracting the maryland poverty rate
maryland_poverty_rate = poverty_rate['maryland']
# setting the size of the graph
plt.figure(figsize=(10,10))
# plotting the income graph
plt.plot(years, maryland_poverty_rate)
plt.title('Graph of Maryland Poverty Rate from 2010 to 2019')
plt.xlabel('Year')
plt.ylabel('Poverty Rate')
plt.show()
The graph shows a decreasing trend across years which suggests that the increase in income might be the reason behind the decrease in poverty rates. Then, we want to look at the poverty rates across county in the year 2019.
# setting the size of the graph
plt.figure(figsize=(10,10))
# plotting the income graph
plt.plot(counties, poverty_rate.loc[years[-1]][1:])
plt.title('Graph of Poverty Rate by County')
plt.xlabel('County')
plt.xticks(rotation=90)
plt.ylabel('Poverty Rate')
plt.show()
The poverty rates seem to have some correlation to income as the counties with higher incomes have lower poverty rates.
The next phase in the data science pipeline is to perform modeling techniques such as linear regression, decision trees and k-nearest-neighbor to obtain predictive model of our data. Using the predictive model, we can carry out hypothesis testing.
In this part, we will fit a linear regression onto our data and use the equation of the regression to predict future income values. A linear model has the equation,
# getting the year index values into a 2d array
x = maryland_median.index
y = maryland_median
m, c = np.polyfit(x,y, deg=1)
plt.figure(figsize=(10,10))
plt.plot(x, y, 'o', x, m*x + c)
plt.xlabel('Year')
plt.ylabel('Median Income')
plt.title('Predicted Graph of Median Income')
plt.show()
print('The slope, m is ' + str(m) +' and the intercept, c is ' + str(c))
The slope, m is 1933.1151515151494 and the intercept, c is -3818109.2727272687
Therefore, our equation is:
m*2022+c
90649.56363636348
Using this equation, our predicted median income for this year, 2022 is $\$$90,649.56
I will be exploring the relationship between poverty rates and income. In order to see if there is a relationship between them, I will be using a linear regression.
Null Hypothesis, $H_0$: There is no relationship between poverty rates and income
Alternative Hypothesis, $H_1$: There is a relationship between poverty rates and income
# creating our model
per_capita_model = linear_model.LinearRegression()
median_model = linear_model.LinearRegression()
x1 = np.array(all_data['income_per_capita']).reshape(len(all_data['income_per_capita']), 1)
x2 = np.array(all_data['median_income']).reshape(len(all_data['median_income']), 1)
y = np.array(all_data['poverty_rate']).reshape(len(all_data['poverty_rate']), 1)
# fitting the data into our model
per_capita_model.fit(x1, y)
median_model.fit(x2, y)
LinearRegression()
Now that we have trained our linear models and trained it using the fit() function, we want to visualize the prediction as well as get its results.
plt.figure(figsize=(10,10))
plt.plot(all_data['income_per_capita'], all_data['poverty_rate'], 'o')
# using the model to predict the values
predicted = per_capita_model.predict(x1)
plt.plot(all_data['income_per_capita'], predicted)
plt.xlabel('Income per capita')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate vs Income Per Capita')
plt.show()
results = ols(formula = 'income_per_capita ~ poverty_rate', data = all_data).fit()
results.summary()
Dep. Variable: | income_per_capita | R-squared: | 0.523 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.521 |
Method: | Least Squares | F-statistic: | 260.7 |
Date: | Thu, 12 May 2022 | Prob (F-statistic): | 4.16e-40 |
Time: | 04:23:15 | Log-Likelihood: | -2489.8 |
No. Observations: | 240 | AIC: | 4984. |
Df Residuals: | 238 | BIC: | 4991. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.672e+04 | 1174.285 | 56.817 | 0.000 | 6.44e+04 | 6.9e+04 |
poverty_rate | -1519.6825 | 94.115 | -16.147 | 0.000 | -1705.088 | -1334.277 |
Omnibus: | 44.671 | Durbin-Watson: | 1.866 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 64.169 |
Skew: | 1.148 | Prob(JB): | 1.16e-14 |
Kurtosis: | 4.070 | Cond. No. | 29.3 |
# plotting the model and getting the results
plt.figure(figsize=(10,10))
plt.plot(all_data['median_income'], all_data['poverty_rate'], 'o')
# using the model to predict the values
predicted = median_model.predict(x2)
plt.plot(all_data['median_income'], predicted)
plt.xlabel('Median Income')
plt.ylabel('Poverty Rate')
plt.title('Poverty Rate vs Median Income')
plt.show()
results = ols(formula = 'median_income ~ poverty_rate', data = all_data).fit()
results.summary()
Dep. Variable: | median_income | R-squared: | 0.745 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.744 |
Method: | Least Squares | F-statistic: | 697.1 |
Date: | Thu, 12 May 2022 | Prob (F-statistic): | 1.14e-72 |
Time: | 04:23:16 | Log-Likelihood: | -2568.6 |
No. Observations: | 240 | AIC: | 5141. |
Df Residuals: | 238 | BIC: | 5148. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 1.083e+05 | 1630.796 | 66.420 | 0.000 | 1.05e+05 | 1.12e+05 |
poverty_rate | -3450.8962 | 130.703 | -26.402 | 0.000 | -3708.380 | -3193.413 |
Omnibus: | 10.874 | Durbin-Watson: | 2.037 |
---|---|---|---|
Prob(Omnibus): | 0.004 | Jarque-Bera (JB): | 11.641 |
Skew: | 0.532 | Prob(JB): | 0.00297 |
Kurtosis: | 2.824 | Cond. No. | 29.3 |
To understand how well the predicted line fits our data, we can use the coefficient of determination, $R^2$ which tells us on a scale of 0 to 1 how well the regression fits the data. The $R^2$ score for poverty rate against income per capita is around $0.523$ and the score for poverty rate against median income is around $0.745$. The score for median income against poverty rate is relatively okay.
Next, we would have to look at the p-value which us whether our null hypothesis is statistically significant. Typically a p-value of $0.05$ is used and if the p-value found is less than that, we would reject the null hypothesis. The p-value found for both our models are $0$ which means that it is nigh impossible for both the pairs to exist given income has no effect on poverty rate. Hence, we will reject our null hypothesis in favor of the alternative hypothesis which is that there is statistically significant evidence that income has effect on poverty rates. Based on the graph, we can see that income inversely impacts the poverty rates.
As a result of our analysis, we confirmed that the trend of income is increasing over time in Maryland. Despite the increase in income, each county in Maryland experiences a different trend in their change of income where it may increase or decrease over time but will eventually lead to a higher income. However, the difference in trend does not impact the income distribution in Maryland which means that counties with higher incomes still remains as is and counties with lower incomes also remains.
Next off we concluded that there is a correlation between income and poverty rate where the higher the income, the lower the poverty rate of the county. However, since the distribution of incomes of counties in Maryland remains the same, the poverty rate distribution also remains unchanged.
In the future, I would like to further extend my findings on increasing income to whether the increase in income can keep up with the increase of living expenses over the years. Also, I would like to further extend my findings to the scale of the whole US rather than just limiting myself to Maryland.