This 16-week introductory course offers a foundation in computer science and data science, equipping students with essential programming, algorithmic thinking, and problem-solving skills. Using Python as the primary programming language, the course covers fundamental programming constructs, data structures, file handling, and an introduction to data visualization techniques. By applying these skills to real-world datasets and problems, students gain practical experience and develop interdisciplinary connections. Designed for students without prior programming experience, the course aims to create a supportive learning environment that inspires students to further hone their computational abilities and critical thinking skills.

**Note:** This module focuses on the technical aspects and hands-on exercises related to spatial interpolation and data-driven decision-making. It builds upon the foundation established in previous lectures, where guest a speaker from the Department of Philosophy provided an introduction to ethical considerations in computer science applications.

**Acknowledgment:** This module was co-developed by Colton Harper and Zachariah Wrublewski.

In this module, students work with an interactive Python notebook that guides them through the process of designing, coding, and evaluating a method to estimate US poverty data using spatial interpolation techniques. The primary focus is on exploring the ethical considerations involved in designing and implementing data-driven solutions, as well as understanding the implications of these decisions on various stakeholders.

- Enhance students' understanding of the ethical considerations in using spatial interpolation methods for estimating poverty data in real-world scenarios.
- Encourage students to think critically about the potential consequences of their methodological choices for various stakeholders, including direct and indirect ones.
- Reinforce students' ability to apply spatial interpolation techniques while considering the ethical dimensions of their work.

- What ethical considerations should be taken into account when using spatial interpolation methods to estimate poverty data?
- How do the choices made in designing a method impact various stakeholders, both directly and indirectly?
- How can students balance the need for accurate estimations with the ethical implications of their work?

- Ethical dimensions in spatial interpolation and data-driven decision-making

Identifying and addressing the needs of direct and indirect stakeholders

- Evaluating trade-offs and potential consequences of methodological choices in a design/development context
- Balancing accuracy, fairness, and transparency in data-driven solutions
- Developing a responsible and informed approach to the design and implementation of algorithms and data analysis methods
- Navigating ethical dilemmas and complexities in computer science applications

Interpolation content and illustrations adapted from: https://gisgeography.com/inverse-distance-weighting-idw-interpolation/

All the utility functions and classes used in this notebook have been organized and stored in a separate Python file called `poverty_interpolation_utils.py`

. This file contains the following classes: `DataDownloader`

, `CountyData`

, `CountyPlotter`

, `CensusData`

, `SamplingMethods`

, `IDWInterpolation`

, `ErrorCalculator`

, and `ErrorVisualizer`

.

You can find this file in the same directory as this notebook. If you need to modify any of the existing classes or functions, you can do so directly in the `poverty_interpolation_utils.py`

file. Make sure to save your changes in the file before running the notebook again.

Alternatively, if you prefer to create new functions or modify existing ones directly in the notebook, you can do so by adding new code cells and defining your functions there. Remember to import any necessary modules or classes in the notebook as needed.

Keeping the utility functions and classes in a separate file, the notebook is a bit more organized, clean, and focused.

In [1]:

```
import pandas as pd
import plotly.express as px
import numpy as np
import math
import requests
from urllib.request import urlopen
import json
import os
import matplotlib.pyplot as plt
from poverty_interpolation_utils import DataDownloader, CountyData, CountyPlotter, CensusData, SamplingMethods, IDWInterpolation, ErrorCalculator, ErrorVisualizer
censusVar = 'DP03_0120PE'
```

**Background:**

- The year is 2018, and you have been hired as a consultant for the US Census Bureau. The American Community Survey (ACS), which typically comes out every five years, is facing budget constraints that prevent a full-scale survey. As a result, the Census Bureau can only survey 50% of US counties for poverty data.
- Your task is to provide the Census Bureau with suggestions on how they should conduct the survey and recommend a method to supplement the data to best estimate the percent of families in poverty in the counties that were not selected to be surveyed.

**Approach Overview:**

- Determine the poverty indicator to be used in the survey.
- Select the counties to be surveyed (limited to 50% of the total counties).
- Use spatial interpolation methods to estimate the poverty rate in the unsampled counties.
- Design a performance measure to evaluate the accuracy of the estimations and the implications of the chosen method.
- Optimize the method to improve performance while considering ethical implications.

**Use Case/Implications:**

- The US Census Bureau will disseminate the results to donors and government officials, who will use the information to develop policies to aid areas with higher poverty rates. The Census Bureau has not provided any additional information regarding the use of the results.

**Our Goal:** Design a method to sample 50% of Nebraska counties to obtain the percentage of households in poverty. Then, use spatial interpolation methods to estimate the poverty rate in the remaining (unsampled) 50% of Nebraska counties.

**Call to Action:**
As developers, we need to understand the elements of our design that can impact the performance of our method. Let's explore some essential design considerations and their tradeoffs.

**General Approach:**

- Identify the poverty indicator for the survey
- Select the counties to survey (50% of total counties)
- Estimate poverty values for unsampled counties using spatial interpolation
- Design a performance measure to evaluate our estimations
- Optimize our method to improve performance

Though there are numerous poverty indicators, we'll focus on a single measure for the scope of this project.

We will use the following variable:

`DP03_0120PE`

represents the county-level data for the percentage of families and people whose income in the past 12 months is below the poverty level. Specifically, it corresponds to impoverished households with related children of the householder under 18 years.- This variable is more closely related to child poverty in Nebraska.
- More information about this variable can be found here.

Let's start by importing the census data.

We will fetch the ACS data for 2015 and 2020.

- We will sample from the ACS 2020 data once we decide on a sampling method to test. Then, we can compare the interpolated values with the actual values.
- We can use the ACS 2015 data to inform some of our design decisions, such as choosing a more informed sampling method.

In [2]:

```
# Instantiate the Classes
data_downloader = DataDownloader()
county_data = CountyData(data_downloader)
census_data = CensusData(county_data)
county_plotter = CountyPlotter()
```

In [3]:

```
# Download the necessary data using the instances:
data_downloader.download_fips_data()
data_downloader.download_county_centers()
county_plotter.download_geojson()
```

In [4]:

```
# Get Census Datasets
censusData2015 = census_data.getCensusPovertyDataByYear_ne("2015")
censusData2020 = census_data.getCensusPovertyDataByYear_ne("2020")
censusData2020.head()
```

Out[4]:

DP03_0120PE | state | county | fips_id | Latitude | Longitude | countyName | stateName | |
---|---|---|---|---|---|---|---|---|

0 | 3.2 | 31 | 179 | 31179 | 42.210746 | -97.126243 | Wayne County | Nebraska |

1 | 3.2 | 31 | 089 | 31089 | 42.459287 | -98.784766 | Holt County | Nebraska |

2 | 3.4 | 31 | 081 | 31081 | 40.877145 | -98.021943 | Hamilton County | Nebraska |

3 | 3.6 | 31 | 039 | 31039 | 41.915865 | -96.788517 | Cuming County | Nebraska |

4 | 3.9 | 31 | 165 | 31165 | 42.483806 | -103.742605 | Sioux County | Nebraska |

Let's take a quick look at Nebraska's poverty data for 2015.

We will visualize the data as a heatmap, where counties with high poverty rates (25%+) will appear in bright yellow-green or green. In contrast, counties with low poverty rates (5% and below) will be represented in dark blue.

In [5]:

```
import plotly.express as px
import plotly.io as pio
from IPython.display import display, Image
print("Percent of Houses Under the Poverty Line Across Counties in Nebraska")
fig = county_plotter.plotCountyData_ne(censusData2015)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
```

Percent of Houses Under the Poverty Line Across Counties in Nebraska

Out[5]:

Remember, we can only survey (sample from) 50% of Nebraskan counties. Which ones should we sample? Let's consider a few approaches.

**Random Sampling**- This method is very straightforward. We can consider sampling any random 50% of counties in Nebraska.

- Can we do better than random sampling? Well, we'll have to think about it and test it out.
- We have the 2015 data, which seems like it might correlate somewhat with the 2020 data. Perhaps we could reference the 2015 data to make an informed decision on which counties to sample.

**Sample the 50% of counties that previously had the highest poverty rates in the last survey**

- The data from our methods will be used to distribute resources to help people in poverty. So, what if we identify the counties from 2015 that had the highest poverty rates and use those counties as our samples?

**Representative Sample**

- Let's also try sampling in a way where we try to get counties that have the highest poverty rates and the lowest poverty rates.
- Based on the 2015 data, we can sample 25% of the counties that have the highest poverty rates.
- We can also sample the 25% of counties that have the lowest poverty rates.

There are many more ways to sample the data, many of which will likely lead to better results than the above sampling methods. We encourage you to think of some additional sampling methods.

- Can you identify any better sampling methods?
- Can you identify other sampling methods you should clearly steer clear from?

In [6]:

```
# Instantiate the `SamplingMethods` class:
sampling_methods = SamplingMethods()
# Modify the code to use the class instance and its methods:
sample_1 = sampling_methods.getHalfCounties_random(censusData2020)
sample_2 = sampling_methods.getHalf_highestPovCounties(censusData2015, censusData2020)
sample_3 = sampling_methods.get25PercLowestPov_25PercHighestPov(censusData2015, censusData2020)
```

Let's take a look at one of the sample set values plotted on a map.

Here, we'll consider `sample_1`

, our random sample:

In [7]:

```
from IPython.display import Image, display
fig = county_plotter.plotCountyData_ne(sample_1)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
```

Out[7]:

`sample_1`

to one of the other two samples.¶We have sampled 50% of the counties in Nebraska. The remaining 50% of counties we must estimate. We can do so (with varying degrees of error) using interpolation methods.

**Interpolation**

- When you are given a set of known values, interpolation helps you estimate the unknown values.
- You may find an illustration of a simple case of "linear" interpolation below, where the red dots are the known values, and we are trying to estimate a value in between.

In [8]:

```
from IPython.display import Image, display
print("Linear Interpolation")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/Linear-Interpolation-2.png"))
```

Linear Interpolation

Spatial interpolation is a similar method, but applied to higher dimensional data.

Examples where it makes sense to apply spatial interpolation methods include:

- Estimating the rainfall in various neighborhoods when you have some rainfall data of the surrounding neighborhoods.
- It makes sense to apply spatial interpolation here because the data is spatially correlated. That is, it’s more likely to rain 1 meter away compared to 500 meters away.

There are many spatial interpolation algorithms to choose from, e.g., IDW, kriging, spline, etc.

IDW is a simple spatial interpolation algorithm that is often used. We will only consider IDW for this lab.

**How IDW Works:**

IDW allows you to estimate one point by conducting a weighted average of the neighboring points around it. The farther away a point is, the less it contributes to the estimate.

There are two main settings to IDW that you can change to improve your estimations.

**Number of Neighbors**The number of neighbors to consider in your estimation. The default tends to be about 5 neighbors.**Power**The power governs how much neighboring points of different distances impact the value of the estimated point.- A lower power allows farther points to have a higher impact on the estimated point
- A high power makes neighbors farther away impact the estimated point less
- The default power tends to be 2

The image below illustrates a plane with four known values (in red) and a value we want to interpolate (in purple). In this illustration, the estimate of the unknown point will be some weighted average of 3 of its closest neighbors. We can vary the number of neighbors we consider in pursuit of better results.

In [9]:

```
print("Inverse Distance Weighting with 3 points")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-3Points.png"))
```

Inverse Distance Weighting with 3 points

**Power Setting: 1**

Immediately below, you'll find an illustration where an unknown point is found using a 3-neighbor IDW with a power of 1. The image following, illustrates the same data and settings, except the power is set to 2.

While it doesn't make a huge impact in this case, you can see that in the case of `power = 1`

, the estimated value is lower because the points farther away are having more impact.

In the latter illustration, when `power = 2`

, the estimated value is greater because the point closer to it is contributing more to the weighted average.

In [10]:

```
print("Illustration of spatial interpolation with a power of 1")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-Power1.png"))
print("Illustration of spatial interpolation with a power of 2")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/IDW-Power2.png"))
print("Inverse Distance Weighting formula")
display(Image("https://gisgeography.com/wp-content/uploads/2016/05/idw-formula.png"))
```

Illustration of spatial interpolation with a power of 1

Illustration of spatial interpolation with a power of 2

Inverse Distance Weighting formula

**Formula:**
You only need to understand this at a conceptual level, so feel free to ignore this formula. If you would like to develop a stronger mathematical intuition, here is the general formula for IDW.

Notice, we use a custom function `interpolate()`

we provide it the censusData, our sampled data, the IDW power, and the number of neighbors.

In this example, our sample is `sample_1`

, a random sample of 50% of the counties in Nebraska. Our `power`

is set to the value of 2, and the number of neighbors, `numNeighbors`

, is set to consider the nearest 5 neighbors.

In [12]:

```
# Instantiate the `IDWInterpolation` class:
idw_interpolation = IDWInterpolation()
# Interpolate the values
sample_1_interp = idw_interpolation.interpolate(censusData2020, sample_1, power=2, numNeighbors=5)
# Plot a sampled and interpolated data
fig = idw_interpolation.plotSampleWithInterp(sample_1, sample_1_interp, county_plotter)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
```

Out[12]:

There is no straightforward answer to this. We can consider some common measures, but we may want to develop a measure more customized for our context. In general, our performance measure is going to be some function of the error between our poverty rate estimates and the actual value.

**Recall the General Goal:** To develop a map with estimated values that may eventually be used to inform how funds are distributed to people in poverty.

We want our performance measure to align with this aim. We can change elements of our method design, and check the performance using our performance measure. We can iteratively adapt the design and optimize to find what we think will be the 'best' design.

**Performance Measures We'll Consider:**

**Average Percent Error**

- In general the error of an estimate is just:
`actualValue - estimatedValue`

- The average percent error is just an average of each of these errors
**Limitation**a key limitation of this method is that we can have a 0% error even when we have large errors. Consider the case of an error where we overestimate a value by 10% and underestimate a value by 10%. Our average error of these two estimates would average to be 0% error.

**Average Absolute Error**

- Average absolute error takes the average of the absolute value of the percent error. This way, we can see the average magnitude of error.
**Limitation**a huge error counts just about as much as a small error

**Mean Squared Error**

- The mean squared error squares each of the errors and then averages them.
- Since the error is squared before it is averaged, large errors penalize the performance score by a lot more than small errors do.
**Limitation**the mean squared error is a relative measure and only really makes sense when you are comparing the mean squared error of two models.

**Root Mean Squared Error**

- This method is the square root of the mean squared error
- This is often used in machine learning

**Binary Classification of Error Below a Threshold**

- This measure calculates the percentage of estimations whose absolute error is less than a certain value. (e.g., what percent of our estimations have less than a 3% error?)

**Errors 1-4 by Poverty Quartile**(Lowest poverty, "middle low" poverty, "middle-high" poverty, and Highest poverty)

- We want to make sure we have high accuracy estimates particularly for households in high poverty, since resource distribution decisions may be made on this basis.
- It may make sense for us to look at the error measures among subgroups. This measure, measure the above poverty errors, but does so for the poverty quartiles.

In [13]:

```
# Get the real values that correspond to the interpolated values
sample_1_interp_withActual = idw_interpolation.getRealvaluesGivenInterpolated(censusData2020, sample_1_interp)
# Calculate the error
error_calculator = ErrorCalculator()
percentBound = 3
errors = error_calculator.getErrors(sample_1_interp_withActual, percentBound, printErrors=True)
```

In [14]:

```
result = error_calculator.getErrorsByQuartile(sample_1_interp_withActual, percentBound, printErrors=True)
quartile_errors = result['quartile_errors']
poverty_ranges = result['poverty_ranges']
quartile_errors
```

ERROR FOR QUARTILE #1 -- 3.2% poverty to 6.7% poverty: ERROR FOR QUARTILE #2 -- 6.8% poverty to 10.8% poverty: ERROR FOR QUARTILE #3 -- 10.9% poverty to 15.3% poverty: ERROR FOR QUARTILE #4 -- 15.4% poverty to 36.0% poverty:

Out[14]:

{'Quartile 1 Errors': {'Average Error': 5.036136112119926, 'Average Absolute Error': 5.036136112119926, 'Mean Squared Error': 26.95467124810375, 'Root Mean Squared Error': 5.191788829305729, 'Percent Under Error Threshold': '0.0%'}, 'Quartile 2 Errors': {'Average Error': 0.40718513287355773, 'Average Absolute Error': 2.4841513297567284, 'Mean Squared Error': 9.767625895584974, 'Root Mean Squared Error': 3.125320126896599, 'Percent Under Error Threshold': '80.0%'}, 'Quartile 3 Errors': {'Average Error': -2.2133597558727853, 'Average Absolute Error': 2.973863697072576, 'Mean Squared Error': 12.979338252317518, 'Root Mean Squared Error': 3.6026848671952307, 'Percent Under Error Threshold': '61.538%'}, 'Quartile 4 Errors': {'Average Error': -8.598379593196103, 'Average Absolute Error': 8.598379593196103, 'Mean Squared Error': 106.7169158677671, 'Root Mean Squared Error': 10.330387982441275, 'Percent Under Error Threshold': '8.333%'}}

In [15]:

```
error_visualizer = ErrorVisualizer() # Create an instance of ErrorVisualizer class
error_visualizer.plot_error_barchart(quartile_errors, poverty_ranges)
```

Based on example quartile errors provided below, we can interpret the performance of the interpolation method and sampling techniques on the poverty data. Here's an example interpretation of the errors and insights that can help you with the exercises:

**Quartile 1 (3.2% to 7.6% poverty):**High errors, indicating lower accuracy for counties with lower poverty rates.- Average Absolute Error: 6.73%
- Root Mean Squared Error: 7.14%
- Predictions within 3% error threshold: 8.33%

**Quartile 2 (8.5% to 10.9% poverty):**Lower errors, suggesting better performance for counties with mid-range poverty rates.- Average Absolute Error: 2.24%
- Root Mean Squared Error: 2.85%
- Predictions within 3% error threshold: 81.82%

**Quartile 3 (11.0% to 13.4% poverty):**Low errors, indicating good performance for counties with mid-range poverty rates.- Average Absolute Error: 1.79%
- Root Mean Squared Error: 2.3%
- Predictions within 3% error threshold: 83.33%

**Quartile 4 (13.6% to 36.0% poverty):**High errors, suggesting lower accuracy for counties with higher poverty rates.- Average Absolute Error: 6.34%
- Root Mean Squared Error: 8.89%
- Predictions within 3% error threshold: 41.67%

**Sampling Method:**Random selection (`getHalfCounties_random`

function). May not be representative of the entire dataset, leading to uneven distribution and affecting interpolation accuracy.**Interpolation Method:**Inverse Distance Weighting (IDW) technique (`standard_idw`

function). The choice of power and the number of nearest neighbors can influence accuracy.

The interpolation method performs well for mid-range poverty rates (Quartiles 2 and 3) but has higher errors for lower and higher poverty rates (Quartiles 1 and 4). Possible reasons:

- Random sampling method might not provide a representative sample across poverty levels, leading to insufficient data for accurate interpolation.
- The choice of power and the number of nearest neighbors in IDW method might not be optimal for all poverty levels, causing bias in estimates.

These design decisions have implications:

- Inaccurate poverty estimates can impact resource allocation and support for people in need, causing disparities in assistance.
- Random sampling method might not capture spatial patterns of poverty effectively, leading to less accurate interpolation.
- The choice of power and the number of nearest neighbors in IDW method can influence trade-offs between accuracy, fairness, and transparency, with ethical implications.

In this section, you will work on improving the interpolation method settings and the sampling techniques. You will also discuss and analyze the ethical implications, development aspects, and stakeholder considerations of your approach.

**Exercise 1: Interpolation Methods and Sampling Techniques**

- Modify the code below and try different interpolation method settings on different samples. Record your results for comparison.
- Experiment with various sampling methods and analyze their performance using the performance measures discussed earlier.

**Exercise 2: Identifying Stakeholders and Evaluating Impacts**

- Identify the direct and indirect stakeholders impacted by your methodological choices.
- Analyze the potential consequences of your choices on these stakeholders, considering issues like accuracy, fairness, and transparency.

**Exercise 3: Ethical Considerations and Development Aspects**

- Discuss the ethical implications of using different sampling methods and interpolation techniques when estimating poverty rates.
- Reflect on the potential consequences of inaccurate estimates on the distribution of resources to people in poverty, and the implications for sustainable development goals.

**Exercise 4: Balancing Trade-offs and Navigating Dilemmas**

- Evaluate the trade-offs between accuracy, fairness, and transparency in your methodological choices.
- Explore potential solutions to ethical dilemmas you encounter in designing your methods, considering the needs and concerns of various stakeholders.

**Collaborative Activity: Group Discussion and Presentation**

- Form small groups to discuss your findings from the exercises above. Share your insights and debate the pros and cons of different models, sampling methods, and performance measures in the context of ethical considerations and development impacts.
- Each group will present their most promising model, sampling method, and performance measure, along with their reasoning and a discussion of ethical considerations, stakeholder implications, and development aspects.

In [18]:

```
# Interpolate the values
power = 2 # consider changing this value and observe interpret any changes in performance
numNeighbors = 5 # consider changing this value and observe interpret any changes in performance
sample = sample_2 # consider changing this value and observe interpret any changes in performance
sample_2_interp = idw_interpolation.interpolate(censusData2020, sample, power, numNeighbors)
# Plot a sampled and interpolated data
fig = idw_interpolation.plotSampleWithInterp(sample_2, sample_2_interp, county_plotter)
img_bytes = pio.to_image(fig, format="png") # Convert the figure to an image in memory (PNG format)
Image(img_bytes)
```

Out[18]: