Estimating populations accurately is crucial for urban planning, disaster response, and resource management. Traditional methods of population estimation often rely on census data, which can be time-consuming and expensive to collect. However, advancements in satellite imagery and data science offer new avenues for rapid and cost-effective population estimation.
One such approach involves using nighttime lights data from NASA. The rationale is simple: regions with more intense nighttime lights are typically more populated. This method can provide timely and detailed insights into population distribution, especially in areas where census data might be outdated or unavailable.
In this R Markdown document, we will explore how to use nighttime lights data to estimate populations. Our approach involves several key steps:
Data Preparation: We will start by preparing the data, which includes reading and cleaning shapefiles, integrating cell tower data, and transforming spatial data into a consistent coordinate system.
Model Building: Next, we will define and train several models, including OLS, BART, SVM, Random Forest, and XGBoost, to predict population based on the nighttime lights data and other relevant predictors.
Model Evaluation: We will evaluate the performance of these models using metrics such as RMSE and compare them against the LandScan dataset from Oak Ridge National Laboratory (ORNL), which serves as a baseline.
Visualization and Analysis: Finally, we will visualize the performance of our models across different countries and save the results for further analysis.
By the end of this document, we aim to demonstrate the feasibility and effectiveness of using nighttime lights data for population estimation, providing a valuable tool for policymakers and researchers alike.
require(rJava)
options(java.parameters = "-Xmx50g")# For BART. I imagine we will need some RAM
require(bartMachine)
set_bart_machine_num_cores(25)
## bartMachine now using 25 cores.
library(xtable)
library(sf)
library(tidyverse)
library(randomForest)
library(xtable)
library(e1071) # For SVM
library(tools) # For toTitleCase
library(xgboost)
all_data <- c()
# Import any other libraries you need here
To get started, we need a robust toolkit of model building functions. These functions will help us train various models, assess their performance, and fine-tune them to ensure our predictions are as accurate as possible. Whether you’re a seasoned data scientist or a curious beginner, the functions we cover here will provide a solid foundation for your work with nighttime lights data.
First up, we’ll look at how to calculate key performance metrics like RMSE (Root Mean Squared Error) and MAPE (Mean Absolute Percentage Error). These metrics will help us understand how well our models are performing. Then, we’ll explore several different models, including BART (Bayesian Additive Regression Trees), SVM (Support Vector Machine), Random Forest, and XGBoost. Each of these models has its strengths and weaknesses, and by comparing their performance, we can choose the best tool for the job.
Finally, we’ll discuss how to generate informative plots and descriptive statistics to visualize and summarize our data. These visual tools are crucial for interpreting our results and communicating them effectively to others.
Let’s dive in and start building some powerful models! While we are at it, let’s create some plotting functions as well!
# Function to calculate RMSE
calculate_rmse <- function(predictions, actual) {
sqrt(mean((predictions - actual) ^ 2, na.rm = TRUE))
}
calculate_mape <- function(predictions, actual) {
mean(abs((actual - predictions) / actual), na.rm = TRUE) * 100
}
# Function for BART model RMSE
calculate_bart_rmse <- function(predictors, controls, train_data, test_data, dv) {
X_train <- train_data[, c(predictors, controls)]
X_test <- test_data[, c(predictors, controls)]
model <- bartMachine(X = X_train, y = train_data[, dv])
predictions <- predict(model, new_data = X_test)
calculate_rmse(predictions, test_data[, dv])
}
# Function for SVM model RMSE
calculate_svm_rmse <- function(predictor, controls, train_data, test_data, dv) {
control_terms <- paste(controls, collapse = " + ")
formula <- as.formula(paste(dv, "~", predictor, "+", control_terms))
model <- svm(formula, data = train_data)
predictions <- predict(model, newdata = test_data)
calculate_rmse(predictions, test_data[, dv])
}
# Function for Random Forest model RMSE
calculate_rf_rmse <- function(predictor, controls, train_data, test_data, dv) {
control_terms <- paste(controls, collapse = " + ")
formula <- as.formula(paste(dv, "~", predictor, "+", control_terms))
model <- randomForest(formula, data = train_data)
predictions <- predict(model, newdata = test_data)
calculate_rmse(predictions, test_data[, dv])
}
# Function for XGBoost model RMSE with Cross-Validation
calculate_xgboost_rmse <- function(predictors, controls, train_data, test_data, dv) {
# Prepare data for xgboost (using DMatrix)
data_matrix <- xgb.DMatrix(data = as.matrix(train_data[, c(predictors, controls)]), label = train_data[, dv])
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, c(predictors, controls)]), label = test_data[, dv])
# Parameters for the XGBoost model with DART booster
params <- list(
booster = "dart",
objective = "reg:squarederror",
eta = 0.1,
max_depth = 10, # Increased depth
subsample = 0.5,
colsample_bytree = 0.5,
sample_type = "uniform",
normalize_type = "tree",
rate_drop = 0.1 # Dropout rate
)
# Cross-validation to find the optimal number of rounds
cv_results <- xgb.cv(
params = params,
data = data_matrix,
nrounds = 500,
nfold = 5, # Number of folds in CV
metrics = "rmse",
early_stopping_rounds = 10
)
best_nrounds <- cv_results$best_iteration
# Train the final model on the full training set
final_model <- xgb.train(params = params, data = data_matrix, nrounds = best_nrounds)
# Make predictions on the test set
predictions <- predict(final_model, test_matrix)
calculate_rmse(predictions, test_data[, dv])
}
# Improved function to automatically generate and format plots
generate_plots <- function(data, C) {
# Exclude geometry column and identify variable names
variables <- setdiff(names(data), c("geometry"))
# Generate a plot for each variable
for (variable in variables) {
# Format the title
title <- gsub("_", " ", variable)
title <- tools::toTitleCase(title)
full_title <- paste(title, "for", C)
# Generate the plot using tidy evaluation
p <- ggplot(data) +
geom_sf(aes(fill = log(!!sym(variable)))) +
scale_fill_viridis_c(na.value = "lightgrey") +
labs(title = full_title, fill = paste("Log", title)) +
theme_minimal()
# Save the plot
plot_filename <- paste0("Plots/", C, "_", variable, ".pdf")
ggsave(plot_filename, p, width = 10, height = 7)
}
}
# Modified Function to calculate and save descriptive statistics as a LaTeX table
calculate_and_save_descriptive_stats <- function(data, country_name) {
# Ensure data is a dataframe
if (!is.data.frame(data)) {
stop("Input must be a dataframe")
}
# Select only numeric columns
numeric_data <- data[, sapply(data, is.numeric)]
# Calculate statistics
stats <- data.frame(
Variable = row.names(data.frame(
Mean = apply(numeric_data, 2, mean, na.rm = TRUE),
Median = apply(numeric_data, 2, median, na.rm = TRUE),
SD = apply(numeric_data, 2, sd, na.rm = TRUE),
Min = apply(numeric_data, 2, min, na.rm = TRUE),
Max = apply(numeric_data, 2, max, na.rm = TRUE)
)),
Mean = apply(numeric_data, 2, mean, na.rm = TRUE),
Median = apply(numeric_data, 2, median, na.rm = TRUE),
SD = apply(numeric_data, 2, sd, na.rm = TRUE),
Min = apply(numeric_data, 2, min, na.rm = TRUE),
Max = apply(numeric_data, 2, max, na.rm = TRUE)
)
rownames(stats) <- NULL # Remove row names
# Replace underscores with spaces in variable names
stats$Variable <- gsub("_", " ", stats$Variable)
# Convert to LaTeX table using xtable
latex_table <- xtable(stats, caption = paste("Descriptive Statistics for", country_name))
# Create filename for the table
table_filename <- paste0("Tables/Descriptive_Stats_for_", country_name, ".tex")
# Use sink to redirect output to file
sink(table_filename)
print(latex_table, include.rownames = FALSE, floating = TRUE, hline.after = c(-1, 0, nrow(stats)))
sink()
# Notify user about file creation
message("Descriptive statistics saved as LaTeX table in: ", table_filename)
}
Now that we have our model-building functions, let’s see how we can apply them to estimate populations using nighttime lights data from NASA.
Data Preparation:
We start by listing all the folders containing shapefiles in our dataset. Each folder represents a different country. We read these shapefiles and select the relevant columns for analysis, renaming them for clarity.
Handling missing or invalid data is crucial. We replace negative
values with NA
and filter out extreme values likely to be
errors. We also transform the spatial data to a common coordinate system
(WGS 84) for consistency.
Cell Tower Data Integration:
Next, we incorporate cell tower data, which provides additional information about human activity. By converting the cell tower data into a spatial format, we can join it with our shapefiles to count the number of cell towers in each geographic area. This count is hopefully important predictor of economic activity.
Data Splitting and Model Training:
Once our data is ready, we split it into training and testing sets. The training set builds our models, while the testing set evaluates their performance. We define several dependent variables (like population) and a set of predictors (such as nighttime lights data, elevation, and precipitation).
We use various models to estimate population from these predictors, including OLS regression, BART, SVM, Random Forest, and XGBoost. Comparing their performance helps us choose the best model for our specific needs.
Model Evaluation:
To assess the accuracy of our models, we calculate the Root Mean Squared Error (RMSE) for each one. RMSE gives us an idea of how close our predictions are to the actual population values. We also compare our models’ performance to a baseline provided by the LandScan data, a well-established population dataset.
Visualizing and Saving Results:
Finally, we generate plots to visualize our results and save descriptive statistics and model performance metrics in LaTeX tables. These visual and tabular summaries are crucial for interpreting our findings and communicating them effectively.
By following these steps, we transform raw satellite images and auxiliary data into actionable insights about population distribution.
# List folders in 'gpw-v4_Shapefiles'
folders <- list.dirs("gpw-v4_Shapefiles", full.names = TRUE, recursive = FALSE)
for (folder in folders) {
country_name <- basename(folder)
shapefile_path <- file.path(folder, "intersection.shp")
# Read shapefile
# Read shapefile
shape <- st_read(shapefile_path) %>%
select(
gpw.v4.pop,
output_mea,
output_sig,
landscan.g,
bio_6,
bio_12,
matches("^Elevation"), # Selects any column that starts with 'Elevation'
GLOBCOVER_,
Albedo_dat
) %>%
rename(Mean_stack = output_mea, LandScan = landscan.g,
Population = gpw.v4.pop, Sigma_stack = output_sig,
GLOBCOVER = GLOBCOVER_, Min_Temp = bio_6,
Precipitation = bio_12, Albedo = Albedo_dat,
Elevation = matches("^Elevation")) %>%
mutate(across(c(Population, LandScan, Albedo), ~if_else(. < 0, NA_real_, .))) |>
mutate(across(where(is.numeric), ~replace(., . > 65000, NA))) |>
na.omit()
# Transform data to WGS 84 (EPSG:4326)
shape <- st_transform(shape, 4326)
# Construct the file path using sprintf()
file_path <- sprintf("OpenSource_Cell_data/%s/%s_celltower_coordinates.csv", country_name, country_name)
# Read the CSV file using the constructed file path
cell_towers <- read.csv(file_path)
# Convert the cell tower data frame to a spatial object
# Assuming the columns are named 'lat' for latitude and 'lon' for longitude
cell_towers_sf <- st_as_sf(cell_towers, coords = c("longitude", "latitude"), crs = st_crs(shape))
shape <- shape |>
mutate(ID = 1:n())
points_in_shapes <- st_join(cell_towers_sf, shape)
# Count the number of points in each shape
count_per_shape <- points_in_shapes %>%
group_by(ID) %>%
summarise(Cell_Towers = n()) |>
st_drop_geometry() |>
as.data.frame()
# Merge the count back
shape <- shape %>%
left_join(count_per_shape, by = "ID")
# Using mutate() and replace_na()
shape <- shape %>%
mutate(Cell_Towers = replace_na(Cell_Towers, 0))
# Count the number of observations
num_observations <- nrow(shape)
# Generate plots (modify your plot function to save plots)
generate_plots(shape, country_name)
# Data preparation for model
st_geometry(shape) <- NULL
train_indices <- sample(nrow(shape), size = floor(0.7 * nrow(shape)))
train_set <- shape[train_indices, ]
test_set <- shape[-train_indices, ]
# Define dependent variables and predictors
dvs <- "Population"
predictors <- c("Sigma_stack", "Mean_stack")
controls <- c("GLOBCOVER", "Min_Temp", "Precipitation", "Elevation", "Albedo", "Cell_Towers")
# Store RMSEs
rmse_results <- list(Land = list(), OLS = list(), BART = list(),XGBoost=list() , SVM = list(), RF = list())
for (predictor in predictors) {
# OLS Model with Polynomial terms
print(paste(predictor, "Starting"))
control_terms <- paste(controls, collapse = " + ")
formula <- as.formula(paste(dvs, "~ poly(", predictor, ", 2) + ", control_terms))
model <- lm(formula, data = train_set)
predictions <- predict(model, newdata = test_set)
rmse_results$OLS[[predictor]] <- calculate_rmse(predictions, test_set[, dvs])
# LandScan
model <- lm(Population ~ poly(LandScan, 1), data = train_set)
predictions <- predict(model, newdata = test_set)
rmse_results$Land[[predictor]] <- calculate_rmse(predictions, test_set[, dvs])
print(paste(predictor, "OLS Complete"))
print(paste(predictor, "BART Starting"))
# BART Model
rmse_results$BART[[predictor]] <- calculate_bart_rmse(predictor, controls, train_set, test_set, dvs)
print(paste(predictor, "BART Complete"))
# XGBoost Model
rmse_results$XGBoost[[predictor]] <- calculate_xgboost_rmse(predictor, controls, train_set, test_set, dvs)
print(paste(predictor, "XGBoost Complete"))
# SVM Model
rmse_results$SVM[[predictor]] <- calculate_svm_rmse(predictor, controls, train_set, test_set, dvs)
print(paste(predictor, "SVM Complete"))
# Random Forest Model
rmse_results$RF[[predictor]] <- calculate_rf_rmse(predictor, controls, train_set, test_set, dvs)
print(paste(predictor, "RF Complete"))
}
# Split the data frame
landscan_df <- data.frame(LandScan = rmse_results$Land)
other_models_df <- cbind(OLS = rmse_results$OLS, BART = rmse_results$BART, GBM = rmse_results$XGBoost, SVM = rmse_results$SVM, RF = rmse_results$RF)
# Format row names
formatted_row_names <- gsub("_", " ", rownames(other_models_df))
formatted_row_names <- tools::toTitleCase(formatted_row_names)
rownames(landscan_df) <- "LandScan"
rownames(other_models_df) <- formatted_row_names
# Generate LaTeX tables
caption_landscan <- paste("LandScan for", country_name, "- Number of observations:", num_observations)
caption_other_models <- paste("Out of Sample RMSEs for Population for", country_name, "- Number of observations:", num_observations)
latex_landscan <- xtable(landscan_df, caption = caption_landscan)
latex_other_models <- xtable(other_models_df, caption = caption_other_models, label = paste("table:rmse_", country_name, sep = ""))
# Save RMSE table as LaTeX
table_filename <- paste0("Tables/RMSE_for_", country_name, ".tex")
sink(table_filename)
# Begin table environment
cat("\\begin{table}[ht]\n")
# Add a single caption for the entire table
full_caption <- paste("Out of Sample RMSEs for Population and LandScan for", country_name, "- Number of observations:", num_observations)
cat("\\caption{", full_caption, "}\n")
# Print LandScan table
cat("\\begin{minipage}{.5\\linewidth}\n")
print(latex_landscan, include.rownames = TRUE, floating = FALSE, hline.after = NULL, comment = FALSE)
cat("\\end{minipage}%\n")
# Print Other Models table
cat("\\begin{minipage}{.5\\linewidth}\n")
print(latex_other_models, include.rownames = TRUE, floating = FALSE, hline.after = NULL, comment = FALSE)
cat("\\end{minipage}\n")
# End table environment
cat("\\end{table}\n")
sink()
csv <- cbind(other_models_df, LandScan = as.vector(landscan_df), Stack = row.names(other_models_df), Country = country_name, N = num_observations)
csv_path <- paste0("Tables/RMSE_for_", country_name, ".csv")
write.csv(csv, csv_path)
results <- data.frame(OLS = unlist(rmse_results$OLS), BART = unlist(rmse_results$BART), GBM = unlist(rmse_results$XGBoost), SVM = unlist(rmse_results$SVM), RF = unlist(rmse_results$RF), Country = country_name, N = num_observations, LandScan = landscan_df[1,1], Stack = formatted_row_names)
calculate_and_save_descriptive_stats(shape, country_name)
all_data <- rbind(all_data, results)
}
With our models trained and evaluated, it’s time to visualize the performance across different countries. The goal is to compare the Root Mean Squared Error (RMSE) values for various models and see how they stack up against the LandScan baseline.
Data Transformation:
We begin by combining all the data into a single dataframe and converting it to a long format for easier plotting. This involves filtering out any outliers (like Portugal in this case) and restructuring the data so that each row represents a model’s RMSE value for a specific country.
LandScan Baseline:
For context, we include the LandScan RMSE values as horizontal dashed lines in our plots. This allows us to see how our models compare to this established baseline.
Creating the Plot:
We use ggplot2
to create a bar plot that shows the RMSE
values for each model, with different shades representing the stack
types. The plot is faceted by country, making it easy to compare model
performance across different regions. By setting the angle
and hjust
parameters in the theme
function, we
ensure that the axis labels are readable even when they are rotated for
better visualization.
Saving the Plot:
Finally, we save our plot as a PDF file, ensuring that it is properly formatted for sharing and presentation.
By visualizing the model performance in this way, we can easily identify which models perform best in different countries and how they compare to the LandScan baseline. This comprehensive analysis provides valuable insights into the effectiveness of using nighttime lights data for population estimation.
df <- all_data
row.names(df)<- NULL
# Transform data to long format
df_long <- df |>
filter(Country != "Portugal") |>
pivot_longer( cols = c(OLS, BART, GBM, SVM, RF), names_to = "Model", values_to = "Value")
# Convert 'Model' to a factor with levels in the desired order
df_long$Model <- factor(df_long$Model, levels = c("OLS", "BART", "GBM", "SVM", "RF"))
# Create a new data frame for LandScan values by country
landscan_values <- df |>
filter(Country != "Portugal") |>
# filter(Country != "Philippines") |>
dplyr::select(Country, LandScan) %>%
distinct(Country, LandScan)
# Plot with LandScan horizontal lines for each country
p2<- ggplot(df_long, aes(x = Model, y = Value, fill = Stack)) +
geom_bar(stat = "identity", position = "dodge") +
geom_hline(data = landscan_values, aes(yintercept = LandScan), color = "black", linetype = "dashed") +
scale_fill_manual(values = c("lightgrey", "darkgrey")) +
facet_wrap(~ Country) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Model Performance by Country", x = "Model", y = "RMSE")
p2
Our analysis shows that the models we developed perform comparably to the established LandScan dataset from Oak Ridge National Laboratory (ORNL) in estimating populations using nighttime lights data from NASA. This outcome demonstrates the potential of using satellite imagery for population estimation, providing a viable alternative to traditional methods.
Key Takeaways:
Future Directions:
While our models perform well, there is always room for improvement. Future work could focus on: - Enhancing Data Quality: Acquiring higher-resolution data and improving data preprocessing techniques to reduce noise and improve accuracy. - Model Optimization: Experimenting with additional model architectures and hyperparameter tuning to further enhance performance. - Expanding Data Sources: Integrating more diverse data sources, such as socioeconomic indicators or high-resolution satellite imagery, to provide a more comprehensive view of population distribution.
In conclusion, this project demonstrates the feasibility and effectiveness of using nighttime lights data for population estimation. Our models perform on par with the LandScan dataset, offering a valuable tool for urban planning, disaster response, and resource management. By continuing to refine and expand upon this work, we can further enhance our ability to make informed decisions based on satellite imagery.
Daniel K BaissaHome