The Impact of GDP on Life Expectancy: A Global Analysis

STAT 331 - Final Project

Authors

Kevin Cisneros-Cuevas Soren Fliegel Chris Liu Haoxian Liu

Introduction

Life expectancy is often seen as a key measure of a population’s well-being, but what factors influence how long people live? One widely studied relationship in economics and public health is between life expectancy and Gross Domestic Product per capita (Shkolnikov et al., 2019). What is GDP per capita? Imagine a country as a workshop. The GDP is like the total value of everything produced in that workshop- all the goods and services created, minus the raw materials it took to make them (Ibezim, 2023). Essentially, it is the grand total of a nation’s economic output. And, people use it to measure the overall scale of a country’s economic production. Now, GDP only tells us how big the nation’s economy is, but what if we want to know the economic output per person in the nation. Then, all we have to do is to divide the country’s total GDP by its total population. What this means is that we are observing how much, on average, each citizen is sharing in that prosperity. So, does that mean that a higher income translates to longer lives for its citizens? This study explores the relationship using data from Gapminder, which has historical and projected figures on life expectancy and GDP per capita for 196 countries. Our goal is to determine whether wealthier nations consistently experience longer lifespans, and assess the accuracy of GDP per capita as a predictor of life expectancy.

1. The Data

The first dataset to explore is life expectancy, assembled by “gapminder.org” from various sources, with data ranging from 1800 to 2019, and projections until 2100 from the United Nations. A unit of observation is the number of years a newborn would be expected to live (given the current mortality rate) in a certain year and country. There are 196 listed countries, including Taiwan, Vatican City, and Palestine. We chose to include these in the data, highlighting Taiwan and Palestine in particular, just as our sources did, despite the questioning of their statehoods. Next, we will explore GDP (Gross Domestic Product) per capita, also assembled by “gapminder.org”. GDP, like life expectancy, can be a way to compare the average standard of living across countries through how much economic output the average person creates. However, because the data viewed as an aggregate, GDP per capita does not account for inequality and differences that may occur through social stratification. A unit of observation is the GDP per capita in a country for a certain year in 2017 adjusted international dollars (which are equivalent to USD). This dataset includes blended data from 1800 to 2019, with projections again to 2100.

Hypothesized Relationship between the Variables

The relationship between GDP per capita and life expectancy is something that has already been looked at in other studies and has been talked about as the Preston curve. According to research from the Centre for Economic Policy Research (CEPR) higher income levels usually lead to a increase in life expectancy. This is a relationship that we hope to see with our data although sometimes there are other factors like healthcare policies or political stability that can influence the relationship between our two variables.

1.1 Cleaning the Data

The life expectancy table consists of data from year 1800-2100 which is a large range. Looking at the data, it seems like some countries in this table have missing values for multiple years, which could be undesirable if we want to look for accurate trends. We could technically forward-fill or backward-fill missing data. However, it might also skew our results, so the best option in the end is to remove the countries with a lot of missing values. Let’s say that countries with more than 20% missing values is bad data (20% is our threshold to reliable and representative data that minimizes biases introduced by imputation). As a result, 10 countries were dropped from the dataset due to having more than 20% missing values. This includes countries such as Andorra, Dominica, St. Kitts and Nevis. The excessive amount of missing data for these countries is likely due to the size of these respective countries, therefore a limited statistical infastructure not allowing them to regularly collect data.

2 Linear Regression

2.1 Data Visualization

1. The relationship between GDP per Capita and Life Expectancy

Above is a graph scatter plot depicting the relationship between GDP per Capita versus life expectancy for the year 2017 and in 2017 USD; each dot is a country. The year 2017 was chosen because of its recency and the fact that all currency was adjusted to that year. The two dots in the bottom left represent Lesotho and the Central African Republic, both of which were suffering from unrest in 2017. The dots furthest to the right represent Luxembourg, Qatar, and Singapore; while they have the highest wealth per citizen, they don’t necessarily have the longest life expectancy. We see a positive relationship between the two variables. At first, as GDP per Capita increases, life expectancy increases dramatically. Then, as GDP per Capita reaches increases further, towards the bound of our graph, the positive relationship is less clear and there are diminishing returns in terms of life expectancy. This pattern is similar to the graph of a logarithm. This graph shows an intensely unequal world, with GDP per Capita varying greatly for countries of varying development and wealth. Additionally, we see an unequal spread of life expectancy even among wealthier countries, where wealth inequality and various healthcare systems likely come into play. Another reason for choosing a more recent year is our ability to see the full spread of the pattern from poor countries to very wealthy countries, as this inequality was not present to the same degree in 1800.

2. How this relationship has changed over time?

This animation above shows the same graph from before (GDP per Capita vs Life Expectancy), but over the years 1800-2017. The value here comes from our ability to see an increasing gap between various country’s wealth levels (and life expectancy as a result), but also in the overall life expectancy rising across the globe. This chart reinforces the idea presented above of ever-increasing wealth inequality globally, with some countries exploding in wealth and others staying nearly the same, all from relatively similar starting points in the year 1800. This animation also shows that the logarithmic pattern is followed with a medium strength.

2.2 Linear Regression

Before running the linear regression, summarizing the data to one x value and one y value per country (i.e., the mean GDP per capita and mean life expectancy per country) in the years 1939 to 1945 would be a great way to simplify the regression model and ensuring each country is represented consistently. We decided to investigate 1939 to 1945 because these are the years the Second World War occurred. The linear regression contains two variables of interest: average GDP per capita for each country during the Second World War (explanatory variable:

X

) and average life expectancy for each country during the Second World War (response variable:

Y

From the estimated linear regression model, the population regression model is represented by this equation:

$\hat{y_{i}} = 35.321 + 0.0022 x$

$Y_{i}$ represents the predicted average life expectancy of the i^th country. The intercept, $b_{0} =$ 35.321 represents the estimated life expectancy when the average GDP per capita is zero. The slope, $b_{1} =$ 0.0022 represents the estimated change in life expectancy for every $1.00 increase in GDP per capita, measured in constant 2017 international dollars (PPP-adjusted).

Based off this plot of our linear regression model, the relationship between average GDP per capita and predicted average life expectancy generally follows a strong positive linear trend. This being said, there is no need for a log transformation.

2.3 Model Fit

Table 1: Variability of the Regression Model
Variable Name	𝛔̂²
Fitted	43.29
Residual	51.17
Response	94.46

In Table 1, the estimated variances have been calculated for the predicted life expectancy (fitted values), the residuals, and the actual life expectancy (response values). First, the variance of the response values represents the total amount of variation in life expectancy across observations. Second, we have the variance of the fitted value, which captures how much of the variability in life expectancy is explained by GDP per capita. Third, is the residual variance, which represents the unexplained variability - that is, the portion of life expectancy variation that GDP per capita does not account for.

Assessing the Proportion of Variability Explained by the Model

To determine the proportion of the variability in the response values that was accounted in our model, we would first need to calculate the R² , which is done by doing this:

$R^{2} = \frac{{\hat{σ}}_{Fitted}^{2}}{{\hat{σ}}_{Response}^{2}}$

Based on the result, our model explains about 45.83% of the variability in life expectancy using GDP per capita. This means that 54.17% of variability remains unexplained. With an R² of 45.83%, the quality of our model is moderate. This suggest that the model is useful but not highly predictive. Although it will give us an insight into the relationship between economic prosperity and life expectancy, it lacks other factors needed for a highly accurate prediction. In other words, GDP per capita is not the sole determining factor for a person’s life expectancy. There are other things to consider as well, such as healthcare access, education, environmental conditions, and government policies.

3. Simulation

3.1 Visualizing Simulations from the Model

Above is a side-by-side showcasing the relationship between GDP and life expectancy for the years 1939-1945. On the right is the actual life expectancy, and on the left is our simulated model. We can see that there are several key similarities and differences between the visualizations. In both, we can see that the positive relationship is maintained, although it is more clearly visible in our simulated data due to an outlier in what actually happened (the low dots on the right). The outlier, far below the predictive line, may have been experiencing unique economic conditions as a result of the war. Otherwise, the simulated data is visibly similar to the actual data, with a large clumping of countries with GDP per capita between 0–5000 and life expectancy between 25–50. These countries were likely developing or heavily impacted by World War II, containing both life expectancy and GDP growth. As mentioned earlier, our model does a reasonable job of capturing the relationship between the two variables. However, other real-world events introduced complexities that it couldn’t perfectly account for.

Simulated Vs. Observed Relationship

We would expect that if the regression model is a good model for life expectancy, then the simulated data should look similar to the observed data. Based on comparing the shape of our scatterplots, the simulated data looks quite similar to what was observed. There aren’t any substantial differences between the two. Now, let’s plot the relationship between the simulated and observed life expectancy for a closer look.

If the simulated data were identical to the observed data, then the data points would be directly on the dashed blue line, indicating a perfect fit. Points that are above the dashed line is when the predicted life expectancy is greater than the actual observed life expectancy. The ones below the dashed line is when the predicted life expectancy is less than the actual observed life expectancy. And in this specific case, we could see that there are roughly the same amount of overestimates and underestimates. Since they are not extremely close to the line, we determine that there is a “moderate” relationship between the observed values and simulated values. There is still some unexplained variation, meaning other factors might be influencing life expectancy that the model does not capture. This makes sense since we have also previously found in the model fit statistics that our model only explains about 45.83% (R² ≈ 0.4583) of the variability in life expectancy using GDP per capita.

3.2 Generating Multiple Predictive Checks

Previously, we conducted a single simulation to compare our model’s predictions with the observed data. However, a single simulation only provides one possible set of predicted life expectancy which is not sufficient because in the real-world, data is influenced by random variation and uncertainty, so we would need to account for that as well. To better evaluate our model’s performance, we chose to generate 1500 simulated datasets. A 1500 simulations will provide a good balance between computational efficiency and statistical reliability. Too few simulations may produce an inaccurate representation of the model’s performance, while a much larger number would take a longer time and it might not result in any significant changes.

Regression of each Simulated on the Original

Now, once we have generated all 1500 simulations, all we have to do now is to go through an iterative process of fitting models for each of the simulated datasets. Doing this will allow us to assess the variability in predictions, determine whether our model systematically overestimates or underestimates the life expectancy, and evaluate its overall reliability. The goal is to ensure that our conclusion are not based on just a single outcome of the simulated data but instead account for different possible outcomes.

Extracting the R² Values

Now, for us to determine how the model performed across many simulated datasets, we will need to take a look at the R²value, which essentially tells us how “close” the simulated values and the observed values are. R²value ranges from 0-1, where 1 indicates that a model did a very good job, whereas 0 indicates that the estimated relationship is extremely weak or is not close at all.

Plotting the Distribution of Simulated R² Values

Once we have extracted all of the R² values for each of the simulations, we can plot their distribution to observe the range in model performance. The distribution of these R² values will provide insight into how well our assumed model produce data similar to what was observed.

Looking at the distribution, it seems like the simulated datasets have R² values between approximately 0.08 to 0.36. On average, our simulated data account for about 21.2 % of the variability in the observed life expectancy. Therefore, the data simulated under this statistical model are low to moderately similar to what was observed. The model captures some of the patterns in the data but leaves a significant portion of the variability unexplained. The fact that most R² values are relatively low (with no distribution near 1) indicates that the model is not doing a really strong job of explaining the variation in life expectancy. The outcome is not entirely surprising because life expectancy is influenced by multitudes of factors such as education, health, environment conditions, government policies, and much more. Relying solely on GDP per capita as a predictor is likely insufficient to capture all these influences. Overall, while our model generates data that reflects some trends in the observed life expectancy, its limited explanatory power suggests that additional explanatory variables are needed to better predict life expectancy.

Conclusion

Our findings reveal that GDP per capita and life expectancy exhibit a positive correlation, meaning that as GDP per capita increases, life expectancy tends to increase as well. However, GDP per capita alone is not a strong predictor of life expectancy. The R² values from our statistical model suggest that while GDP per capita is able to capture some trends based on our simulations, it is still leaving a significant portion of the variability in life expectancy unexplained. What this mean is that GDP per capita should not be the only thing we should be looking at when determining how long a person lives. It’s indicating that there are other factors to consider such as public health, education, environmental conditions, and the government policies, which may play crucial roles in determining longevity.¹

References

Download the data. Gapminder. (n.d.). https://www.gapminder.org/data/

Shkolnikov, V. M., Andreev, E. M., Tursun-Zade, R., & Leon, D. A. (2019). Patterns in the relationship between life expectancy and gross domestic product in Russia in 2005–15: a cross-sectional analysis. The Lancet Public Health, 4(4), e181–e188. https://doi.org/10.1016/s2468-2667(19)30036-2

Ibezim, C. (2023, July 10). Exploring the Relationship between GDP and Life Expectancy in 6 Countries. Medium. https://medium.com/@ibezimchike/exploring-the-relationship-between-gdp-and-life-expectancy-in-6-countries-a91a2bb118a5

Footnotes

All code to reproduce these analyses are available at https://github.com/soren-fliegel/stat331_final_project ↩︎