This 16-week introductory course offers a foundation in computer science and data science, equipping students with essential programming, algorithmic thinking, and problem-solving skills. Using Python as the primary programming language, the course covers fundamental programming constructs, data structures, file handling, and an introduction to data visualization techniques. By applying these skills to real-world datasets and problems, students gain practical experience and develop interdisciplinary connections. Designed for students without prior programming experience, the course aims to create a supportive learning environment that inspires students to further hone their computational abilities and critical thinking skills.
Note: This module focuses on the technical aspects and hands-on exercises related to spatial interpolation and data-driven decision-making. It builds upon the foundation established in previous lectures, where guest a speaker from the Department of Philosophy provided an introduction to ethical considerations in computer science applications.
Acknowledgment: This module was co-developed by Colton Harper and Zachariah Wrublewski.
In this module, students work with an interactive Python notebook that guides them through the process of designing, coding, and evaluating a method to estimate US poverty data using spatial interpolation techniques. The primary focus is on exploring the ethical considerations involved in designing and implementing data-driven solutions, as well as understanding the implications of these decisions on various stakeholders.
Identifying and addressing the needs of direct and indirect stakeholders
Interpolation content and illustrations adapted from: https://gisgeography.com/inverse-distance-weighting-idw-interpolation/
All the utility functions and classes used in this notebook have been organized and stored in a separate Python file called poverty_interpolation_utils.py
. This file contains the following classes: DataDownloader
, CountyData
, CountyPlotter
, CensusData
, SamplingMethods
, IDWInterpolation
, ErrorCalculator
, and ErrorVisualizer
.
You can find this file in the same directory as this notebook. If you need to modify any of the existing classes or functions, you can do so directly in the poverty_interpolation_utils.py
file. Make sure to save your changes in the file before running the notebook again.
Alternatively, if you prefer to create new functions or modify existing ones directly in the notebook, you can do so by adding new code cells and defining your functions there. Remember to import any necessary modules or classes in the notebook as needed.
Keeping the utility functions and classes in a separate file, the notebook is a bit more organized, clean, and focused.
import pandas as pd
import plotly.express as px
import numpy as np
import math
import requests
from urllib.request import urlopen
import json
import os
import matplotlib.pyplot as plt
from poverty_interpolation_utils import DataDownloader, CountyData, CountyPlotter, CensusData, SamplingMethods, IDWInterpolation, ErrorCalculator, ErrorVisualizer
censusVar = 'DP03_0120PE'
Background:
Approach Overview:
Use Case/Implications:
Our Goal: Design a method to sample 50% of Nebraska counties to obtain the percentage of households in poverty. Then, use spatial interpolation methods to estimate the poverty rate in the remaining (unsampled) 50% of Nebraska counties.
Call to Action: As developers, we need to understand the elements of our design that can impact the performance of our method. Let's explore some essential design considerations and their tradeoffs.
General Approach:
Though there are numerous poverty indicators, we'll focus on a single measure for the scope of this project.
We will use the following variable:
DP03_0120PE
represents the county-level data for the percentage of families and people whose income in the past 12 months is below the poverty level. Specifically, it corresponds to impoverished households with related children of the householder under 18 years.Let's start by importing the census data.
We will fetch the ACS data for 2015 and 2020.
# Instantiate the Classes
data_downloader = DataDownloader()
county_data = CountyData(data_downloader)
census_data = CensusData(county_data)
county_plotter = CountyPlotter()
# Download the necessary data using the instances:
data_downloader.download_fips_data()
data_downloader.download_county_centers()
county_plotter.download_geojson()
# Get Census Datasets
censusData2015 = census_data.getCensusPovertyDataByYear_ne("2015")
censusData2020 = census_data.getCensusPovertyDataByYear_ne("2020")
censusData2020.head()
DP03_0120PE | state | county | fips_id | Latitude | Longitude | countyName | stateName | |
---|---|---|---|---|---|---|---|---|
0 | 3.2 | 31 | 179 | 31179 | 42.210746 | -97.126243 | Wayne County | Nebraska |
1 | 3.2 | 31 | 089 | 31089 | 42.459287 | -98.784766 | Holt County | Nebraska |
2 | 3.4 | 31 | 081 | 31081 | 40.877145 | -98.021943 | Hamilton County | Nebraska |
3 | 3.6 | 31 | 039 | 31039 | 41.915865 | -96.788517 | Cuming County | Nebraska |
4 | 3.9 | 31 | 165 | 31165 | 42.483806 | -103.742605 | Sioux County | Nebraska |
Let's take a quick look at Nebraska's poverty data for 2015.
We will visualize the data as a heatmap, where counties with high poverty rates (25%+) will appear in bright yellow-green or green. In contrast, counties with low poverty rates (5% and below) will be represented in dark blue.
import plotly.express as px
import plotly.io as pio
from IPython.display import display, Image
print("Percent of Houses Under the Poverty Line Across Counties in Nebraska")
fig = county_plotter.plotCountyData_ne(censusData2015)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
Percent of Houses Under the Poverty Line Across Counties in Nebraska
Remember, we can only survey (sample from) 50% of Nebraskan counties. Which ones should we sample? Let's consider a few approaches.
Sample the 50% of counties that previously had the highest poverty rates in the last survey
Representative Sample
There are many more ways to sample the data, many of which will likely lead to better results than the above sampling methods. We encourage you to think of some additional sampling methods.
# Instantiate the `SamplingMethods` class:
sampling_methods = SamplingMethods()
# Modify the code to use the class instance and its methods:
sample_1 = sampling_methods.getHalfCounties_random(censusData2020)
sample_2 = sampling_methods.getHalf_highestPovCounties(censusData2015, censusData2020)
sample_3 = sampling_methods.get25PercLowestPov_25PercHighestPov(censusData2015, censusData2020)
Let's take a look at one of the sample set values plotted on a map.
Here, we'll consider sample_1
, our random sample:
from IPython.display import Image, display
fig = county_plotter.plotCountyData_ne(sample_1)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
sample_1
to one of the other two samples.¶We have sampled 50% of the counties in Nebraska. The remaining 50% of counties we must estimate. We can do so (with varying degrees of error) using interpolation methods.
Interpolation
from IPython.display import Image, display
print("Linear Interpolation")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/Linear-Interpolation-2.png"))
Linear Interpolation
Spatial interpolation is a similar method, but applied to higher dimensional data.
Examples where it makes sense to apply spatial interpolation methods include:
There are many spatial interpolation algorithms to choose from, e.g., IDW, kriging, spline, etc.
IDW is a simple spatial interpolation algorithm that is often used. We will only consider IDW for this lab.
How IDW Works:
IDW allows you to estimate one point by conducting a weighted average of the neighboring points around it. The farther away a point is, the less it contributes to the estimate.
There are two main settings to IDW that you can change to improve your estimations.
The image below illustrates a plane with four known values (in red) and a value we want to interpolate (in purple). In this illustration, the estimate of the unknown point will be some weighted average of 3 of its closest neighbors. We can vary the number of neighbors we consider in pursuit of better results.
print("Inverse Distance Weighting with 3 points")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-3Points.png"))
Inverse Distance Weighting with 3 points
Power Setting: 1
Immediately below, you'll find an illustration where an unknown point is found using a 3-neighbor IDW with a power of 1. The image following, illustrates the same data and settings, except the power is set to 2.
While it doesn't make a huge impact in this case, you can see that in the case of power = 1
, the estimated value is lower because the points farther away are having more impact.
In the latter illustration, when power = 2
, the estimated value is greater because the point closer to it is contributing more to the weighted average.
print("Illustration of spatial interpolation with a power of 1")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-Power1.png"))
print("Illustration of spatial interpolation with a power of 2")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-Power2.png"))
print("Inverse Distance Weighting formula")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/idw-formula.png"))
Illustration of spatial interpolation with a power of 1
Illustration of spatial interpolation with a power of 2
Inverse Distance Weighting formula
Formula: You only need to understand this at a conceptual level, so feel free to ignore this formula. If you would like to develop a stronger mathematical intuition, here is the general formula for IDW.
Notice, we use a custom function interpolate()
we provide it the censusData, our sampled data, the IDW power, and the number of neighbors.
In this example, our sample is sample_1
, a random sample of 50% of the counties in Nebraska. Our power
is set to the value of 2, and the number of neighbors, numNeighbors
, is set to consider the nearest 5 neighbors.
# Instantiate the `IDWInterpolation` class:
idw_interpolation = IDWInterpolation()
# Interpolate the values
sample_1_interp = idw_interpolation.interpolate(censusData2020, sample_1, power=2, numNeighbors=5)
# Plot a sampled and interpolated data
fig = idw_interpolation.plotSampleWithInterp(sample_1, sample_1_interp, county_plotter)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
There is no straightforward answer to this. We can consider some common measures, but we may want to develop a measure more customized for our context. In general, our performance measure is going to be some function of the error between our poverty rate estimates and the actual value.
Recall the General Goal: To develop a map with estimated values that may eventually be used to inform how funds are distributed to people in poverty.
We want our performance measure to align with this aim. We can change elements of our method design, and check the performance using our performance measure. We can iteratively adapt the design and optimize to find what we think will be the 'best' design.
Performance Measures We'll Consider:
actualValue - estimatedValue
# Get the real values that correspond to the interpolated values
sample_1_interp_withActual = idw_interpolation.getRealvaluesGivenInterpolated(censusData2020, sample_1_interp)
# Calculate the error
error_calculator = ErrorCalculator()
percentBound = 3
errors = error_calculator.getErrors(sample_1_interp_withActual, percentBound, printErrors=True)
Average Error: -1.4350797291601014 Average Absolute Error: 4.832253826027726 Mean Squared Error: 39.79716386439208 Root Mean Squared Error: 6.308499335372247 Percent Predicted With Smaller than a 3% error: 36.17%
result = error_calculator.getErrorsByQuartile(sample_1_interp_withActual, percentBound, printErrors=True)
quartile_errors = result['quartile_errors']
poverty_ranges = result['poverty_ranges']
quartile_errors
ERROR FOR QUARTILE #1 -- 3.2% poverty to 6.7% poverty: ERROR FOR QUARTILE #2 -- 6.8% poverty to 10.8% poverty: ERROR FOR QUARTILE #3 -- 10.9% poverty to 15.3% poverty: ERROR FOR QUARTILE #4 -- 15.4% poverty to 36.0% poverty:
{'Quartile 1 Errors': {'Average Error': 5.036136112119926, 'Average Absolute Error': 5.036136112119926, 'Mean Squared Error': 26.95467124810375, 'Root Mean Squared Error': 5.191788829305729, 'Percent Under Error Threshold': '0.0%'}, 'Quartile 2 Errors': {'Average Error': 0.40718513287355773, 'Average Absolute Error': 2.4841513297567284, 'Mean Squared Error': 9.767625895584974, 'Root Mean Squared Error': 3.125320126896599, 'Percent Under Error Threshold': '80.0%'}, 'Quartile 3 Errors': {'Average Error': -2.2133597558727853, 'Average Absolute Error': 2.973863697072576, 'Mean Squared Error': 12.979338252317518, 'Root Mean Squared Error': 3.6026848671952307, 'Percent Under Error Threshold': '61.538%'}, 'Quartile 4 Errors': {'Average Error': -8.598379593196103, 'Average Absolute Error': 8.598379593196103, 'Mean Squared Error': 106.7169158677671, 'Root Mean Squared Error': 10.330387982441275, 'Percent Under Error Threshold': '8.333%'}}
error_visualizer = ErrorVisualizer() # Create an instance of ErrorVisualizer class
error_visualizer.plot_error_barchart(quartile_errors, poverty_ranges)
Based on example quartile errors provided below, we can interpret the performance of the interpolation method and sampling techniques on the poverty data. Here's an example interpretation of the errors and insights that can help you with the exercises:
Quartile 1 (3.2% to 7.6% poverty): High errors, indicating lower accuracy for counties with lower poverty rates.
Quartile 2 (8.5% to 10.9% poverty): Lower errors, suggesting better performance for counties with mid-range poverty rates.
Quartile 3 (11.0% to 13.4% poverty): Low errors, indicating good performance for counties with mid-range poverty rates.
Quartile 4 (13.6% to 36.0% poverty): High errors, suggesting lower accuracy for counties with higher poverty rates.
getHalfCounties_random
function). May not be representative of the entire dataset, leading to uneven distribution and affecting interpolation accuracy.standard_idw
function). The choice of power and the number of nearest neighbors can influence accuracy.The interpolation method performs well for mid-range poverty rates (Quartiles 2 and 3) but has higher errors for lower and higher poverty rates (Quartiles 1 and 4). Possible reasons:
These design decisions have implications:
In this section, you will work on improving the interpolation method settings and the sampling techniques. You will also discuss and analyze the ethical implications, development aspects, and stakeholder considerations of your approach.
Exercise 1: Interpolation Methods and Sampling Techniques
Exercise 2: Identifying Stakeholders and Evaluating Impacts
Exercise 3: Ethical Considerations and Development Aspects
Exercise 4: Balancing Trade-offs and Navigating Dilemmas
Collaborative Activity: Group Discussion and Presentation
# Interpolate the values
power = 2 # consider changing this value and observe interpret any changes in performance
numNeighbors = 5 # consider changing this value and observe interpret any changes in performance
sample = sample_2 # consider changing this value and observe interpret any changes in performance
sample_2_interp = idw_interpolation.interpolate(censusData2020, sample, power, numNeighbors)
# Plot a sampled and interpolated data
fig = idw_interpolation.plotSampleWithInterp(sample_2, sample_2_interp, county_plotter)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
# Get the real values that correspond to the interpolated values
sample_2_interp_withActual = idw_interpolation.getRealvaluesGivenInterpolated(censusData2020, sample_2_interp)
# Calculate the error
percentBound = 3
overallErrors = pd.DataFrame.from_dict(error_calculator.getErrors(sample_2_interp_withActual, percentBound), orient="index", columns=["Sample Results"])
errors_by_quartile_result = error_calculator.getErrorsByQuartile(sample_2_interp_withActual, percentBound)
errorsByQuartile = pd.DataFrame.from_dict({(key, sub_key): value for key, sub_dict in errors_by_quartile_result['quartile_errors'].items() for sub_key, value in sub_dict.items()}, orient="index").unstack().droplevel(0, axis=0)
overallErrors
Sample Results | |
---|---|
Average Error | 1.561561 |
Average Absolute Error | 4.43028 |
Mean Squared Error | 29.053239 |
Root Mean Squared Error | 5.390106 |
Percent Under Error Threshold | 36.956% |
errorsByQuartile
(Quartile 1 Errors, Average Error) 7.468757 (Quartile 1 Errors, Average Absolute Error) 7.468757 (Quartile 1 Errors, Mean Squared Error) 61.144233 (Quartile 1 Errors, Root Mean Squared Error) 7.819478 (Quartile 1 Errors, Percent Under Error Threshold) 0.0% (Quartile 2 Errors, Average Error) 2.739632 (Quartile 2 Errors, Average Absolute Error) 3.076405 (Quartile 2 Errors, Mean Squared Error) 14.758129 (Quartile 2 Errors, Root Mean Squared Error) 3.841631 (Quartile 2 Errors, Percent Under Error Threshold) 58.333% (Quartile 3 Errors, Average Error) -0.502834 (Quartile 3 Errors, Average Absolute Error) 2.377943 (Quartile 3 Errors, Mean Squared Error) 11.941773 (Quartile 3 Errors, Root Mean Squared Error) 3.455687 (Quartile 3 Errors, Percent Under Error Threshold) 72.727% (Quartile 4 Errors, Average Error) -3.139077 (Quartile 4 Errors, Average Absolute Error) 4.880194 (Quartile 4 Errors, Mean Squared Error) 29.617114 (Quartile 4 Errors, Root Mean Squared Error) 5.442161 (Quartile 4 Errors, Percent Under Error Threshold) 16.666% dtype: object
Assignment: Census Bureau Brief and Recommendations
In your lab groups, review the existing data and findings from your spatial interpolation exercises for estimating poverty rates in Nebraska. Given the scope of this lab, we didn't extend it to the United States, but use your findings from Nebraska for this assignment.
Write a brief 'policy brief' (1-2 pages) that:
a. Summarizes your findings from the exercises, focusing on the methods used, the ethical considerations, and the development implications.
b. Discusses the challenges and limitations of using spatial interpolation methods for estimating poverty rates in Nebraska.
c. Presents recommendations for using spatial interpolation methods, sampling techniques, and performance measures in Nebraska, along with justifications for their suitability.
d. Reflects on the trade-offs, dilemmas, and potential consequences of implementing your recommendations in the context of Nebraska.
Include a one-page executive summary of your policy brief at the beginning, summarizing the key points and recommendations.
Submit your policy brief along with any code you modified or developed during the exercises.