Estimating populations accurately is crucial for urban planning, disaster response, and resource management. Traditional methods of population estimation often rely on census data, which can be time-consuming and expensive to collect. However, advancements in satellite imagery and data science offer new avenues for rapid and cost-effective population estimation.

One such approach involves using nighttime lights data from NASA. The rationale is simple: regions with more intense nighttime lights are typically more populated. This method can provide timely and detailed insights into population distribution, especially in areas where census data might be outdated or unavailable.

In this R Markdown document, we will explore how to use nighttime lights data to estimate populations. Our approach involves several key steps:

  1. Data Preparation: We will start by preparing the data, which includes reading and cleaning shapefiles, integrating cell tower data, and transforming spatial data into a consistent coordinate system.

  2. Model Building: Next, we will define and train several models, including OLS, BART, SVM, Random Forest, and XGBoost, to predict population based on the nighttime lights data and other relevant predictors.

  3. Model Evaluation: We will evaluate the performance of these models using metrics such as RMSE and compare them against the LandScan dataset from Oak Ridge National Laboratory (ORNL), which serves as a baseline.

  4. Visualization and Analysis: Finally, we will visualize the performance of our models across different countries and save the results for further analysis.

By the end of this document, we aim to demonstrate the feasibility and effectiveness of using nighttime lights data for population estimation, providing a valuable tool for policymakers and researchers alike.

Environment Configuration and Library Import

require(rJava)
options(java.parameters = "-Xmx50g")# For BART. I imagine we will need some RAM
require(bartMachine)
set_bart_machine_num_cores(25)
## bartMachine now using 25 cores.
library(xtable)
library(sf)
library(tidyverse)
library(randomForest)
library(xtable)
library(e1071) # For SVM
library(tools) # For toTitleCase
library(xgboost)

all_data <- c()

# Import any other libraries you need here

Model Building Functions

To get started, we need a robust toolkit of model building functions. These functions will help us train various models, assess their performance, and fine-tune them to ensure our predictions are as accurate as possible. Whether you’re a seasoned data scientist or a curious beginner, the functions we cover here will provide a solid foundation for your work with nighttime lights data.

First up, we’ll look at how to calculate key performance metrics like RMSE (Root Mean Squared Error) and MAPE (Mean Absolute Percentage Error). These metrics will help us understand how well our models are performing. Then, we’ll explore several different models, including BART (Bayesian Additive Regression Trees), SVM (Support Vector Machine), Random Forest, and XGBoost. Each of these models has its strengths and weaknesses, and by comparing their performance, we can choose the best tool for the job.

Finally, we’ll discuss how to generate informative plots and descriptive statistics to visualize and summarize our data. These visual tools are crucial for interpreting our results and communicating them effectively to others.

Let’s dive in and start building some powerful models! While we are at it, let’s create some plotting functions as well!

# Function to calculate RMSE
calculate_rmse <- function(predictions, actual) {
  sqrt(mean((predictions - actual) ^ 2, na.rm = TRUE))
}

calculate_mape <- function(predictions, actual) {
  mean(abs((actual - predictions) / actual), na.rm = TRUE) * 100
}


# Function for BART model RMSE
calculate_bart_rmse <- function(predictors, controls, train_data, test_data, dv) {
  X_train <- train_data[, c(predictors, controls)]
  X_test <- test_data[, c(predictors, controls)]
  model <- bartMachine(X = X_train, y = train_data[, dv])
  predictions <- predict(model, new_data = X_test)
  calculate_rmse(predictions, test_data[, dv])
}

# Function for SVM model RMSE
calculate_svm_rmse <- function(predictor, controls, train_data, test_data, dv) {
  control_terms <- paste(controls, collapse = " + ")
  formula <- as.formula(paste(dv, "~", predictor, "+", control_terms))
  model <- svm(formula, data = train_data)
  predictions <- predict(model, newdata = test_data)
  calculate_rmse(predictions, test_data[, dv])
}

# Function for Random Forest model RMSE
calculate_rf_rmse <- function(predictor, controls, train_data, test_data, dv) {
  control_terms <- paste(controls, collapse = " + ")
  formula <- as.formula(paste(dv, "~", predictor, "+", control_terms))
  model <- randomForest(formula, data = train_data)
  predictions <- predict(model, newdata = test_data)
  calculate_rmse(predictions, test_data[, dv])
}



# Function for XGBoost model RMSE with Cross-Validation
calculate_xgboost_rmse <- function(predictors, controls, train_data, test_data, dv) {
  # Prepare data for xgboost (using DMatrix)
  data_matrix <- xgb.DMatrix(data = as.matrix(train_data[, c(predictors, controls)]), label = train_data[, dv])
  test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, c(predictors, controls)]), label = test_data[, dv])

  # Parameters for the XGBoost model with DART booster
  params <- list(
    booster = "dart",
    objective = "reg:squarederror",
    eta = 0.1,
    max_depth = 10,  # Increased depth
    subsample = 0.5,
    colsample_bytree = 0.5,
    sample_type = "uniform",
    normalize_type = "tree",
    rate_drop = 0.1  # Dropout rate
  )

  # Cross-validation to find the optimal number of rounds
  cv_results <- xgb.cv(
    params = params, 
    data = data_matrix, 
    nrounds = 500,
    nfold = 5, # Number of folds in CV
    metrics = "rmse",
    early_stopping_rounds = 10
  )

  best_nrounds <- cv_results$best_iteration

  # Train the final model on the full training set
  final_model <- xgb.train(params = params, data = data_matrix, nrounds = best_nrounds)

  # Make predictions on the test set
  predictions <- predict(final_model, test_matrix)
  calculate_rmse(predictions, test_data[, dv])
}



# Improved function to automatically generate and format plots
generate_plots <- function(data, C) {
  # Exclude geometry column and identify variable names
  variables <- setdiff(names(data), c("geometry"))

  # Generate a plot for each variable
  for (variable in variables) {
    # Format the title
    title <- gsub("_", " ", variable)
    title <- tools::toTitleCase(title)
    full_title <- paste(title, "for", C)

    # Generate the plot using tidy evaluation
    p <- ggplot(data) + 
      geom_sf(aes(fill = log(!!sym(variable)))) + 
      scale_fill_viridis_c(na.value = "lightgrey") +
      labs(title = full_title, fill = paste("Log", title)) +
      theme_minimal()

    # Save the plot
    plot_filename <- paste0("Plots/", C, "_", variable, ".pdf")
    ggsave(plot_filename, p, width = 10, height = 7)
  }
}



# Modified Function to calculate and save descriptive statistics as a LaTeX table
calculate_and_save_descriptive_stats <- function(data, country_name) {
  # Ensure data is a dataframe
  if (!is.data.frame(data)) {
    stop("Input must be a dataframe")
  }

  # Select only numeric columns
  numeric_data <- data[, sapply(data, is.numeric)]

  # Calculate statistics
  stats <- data.frame(
    Variable = row.names(data.frame(
      Mean = apply(numeric_data, 2, mean, na.rm = TRUE),
      Median = apply(numeric_data, 2, median, na.rm = TRUE),
      SD = apply(numeric_data, 2, sd, na.rm = TRUE),
      Min = apply(numeric_data, 2, min, na.rm = TRUE),
      Max = apply(numeric_data, 2, max, na.rm = TRUE)
    )),
    Mean = apply(numeric_data, 2, mean, na.rm = TRUE),
    Median = apply(numeric_data, 2, median, na.rm = TRUE),
    SD = apply(numeric_data, 2, sd, na.rm = TRUE),
    Min = apply(numeric_data, 2, min, na.rm = TRUE),
    Max = apply(numeric_data, 2, max, na.rm = TRUE)
  )
  rownames(stats) <- NULL  # Remove row names

  # Replace underscores with spaces in variable names
  stats$Variable <- gsub("_", " ", stats$Variable)

  # Convert to LaTeX table using xtable
  latex_table <- xtable(stats, caption = paste("Descriptive Statistics for", country_name))

  # Create filename for the table
  table_filename <- paste0("Tables/Descriptive_Stats_for_", country_name, ".tex")

  # Use sink to redirect output to file
  sink(table_filename)
  print(latex_table, include.rownames = FALSE, floating = TRUE, hline.after = c(-1, 0, nrow(stats)))
  sink()

  # Notify user about file creation
  message("Descriptive statistics saved as LaTeX table in: ", table_filename)
}

Processing Nighttime Lights Data

Now that we have our model-building functions, let’s see how we can apply them to estimate populations using nighttime lights data from NASA.

Data Preparation:

We start by listing all the folders containing shapefiles in our dataset. Each folder represents a different country. We read these shapefiles and select the relevant columns for analysis, renaming them for clarity.

Handling missing or invalid data is crucial. We replace negative values with NA and filter out extreme values likely to be errors. We also transform the spatial data to a common coordinate system (WGS 84) for consistency.

Cell Tower Data Integration:

Next, we incorporate cell tower data, which provides additional information about human activity. By converting the cell tower data into a spatial format, we can join it with our shapefiles to count the number of cell towers in each geographic area. This count is hopefully important predictor of economic activity.

Data Splitting and Model Training:

Once our data is ready, we split it into training and testing sets. The training set builds our models, while the testing set evaluates their performance. We define several dependent variables (like population) and a set of predictors (such as nighttime lights data, elevation, and precipitation).

We use various models to estimate population from these predictors, including OLS regression, BART, SVM, Random Forest, and XGBoost. Comparing their performance helps us choose the best model for our specific needs.

Model Evaluation:

To assess the accuracy of our models, we calculate the Root Mean Squared Error (RMSE) for each one. RMSE gives us an idea of how close our predictions are to the actual population values. We also compare our models’ performance to a baseline provided by the LandScan data, a well-established population dataset.

Visualizing and Saving Results:

Finally, we generate plots to visualize our results and save descriptive statistics and model performance metrics in LaTeX tables. These visual and tabular summaries are crucial for interpreting our findings and communicating them effectively.

By following these steps, we transform raw satellite images and auxiliary data into actionable insights about population distribution.

# List folders in 'gpw-v4_Shapefiles'
folders <- list.dirs("gpw-v4_Shapefiles", full.names = TRUE, recursive = FALSE)

for (folder in folders) {
  country_name <- basename(folder)
  shapefile_path <- file.path(folder, "intersection.shp")
  
  # Read shapefile
 # Read shapefile
  shape <- st_read(shapefile_path) %>%
  select(
    gpw.v4.pop,
    output_mea, 
    output_sig, 
    landscan.g, 
    bio_6, 
    bio_12, 
    matches("^Elevation"),  # Selects any column that starts with 'Elevation'
    GLOBCOVER_, 
    Albedo_dat
  ) %>%
  rename(Mean_stack = output_mea, LandScan = landscan.g, 
         Population = gpw.v4.pop, Sigma_stack = output_sig,
         GLOBCOVER = GLOBCOVER_, Min_Temp = bio_6,
         Precipitation = bio_12, Albedo = Albedo_dat,
         Elevation = matches("^Elevation")) %>%
    mutate(across(c(Population, LandScan, Albedo), ~if_else(. < 0, NA_real_, .))) |>  
    mutate(across(where(is.numeric), ~replace(., . > 65000, NA))) |> 
    na.omit()
  
  # Transform data to WGS 84 (EPSG:4326)
  shape <- st_transform(shape, 4326)
  
    # Construct the file path using sprintf()
  file_path <- sprintf("OpenSource_Cell_data/%s/%s_celltower_coordinates.csv", country_name, country_name)
  
  # Read the CSV file using the constructed file path
  cell_towers <- read.csv(file_path)
  
  # Convert the cell tower data frame to a spatial object
  # Assuming the columns are named 'lat' for latitude and 'lon' for longitude
  cell_towers_sf <- st_as_sf(cell_towers, coords = c("longitude", "latitude"), crs = st_crs(shape))
  
  shape <- shape |> 
    mutate(ID = 1:n())
  
  
  points_in_shapes  <- st_join(cell_towers_sf, shape)
  
  # Count the number of points in each shape
  count_per_shape <- points_in_shapes %>%
    group_by(ID) %>%
    summarise(Cell_Towers = n()) |> 
    st_drop_geometry() |> 
    as.data.frame()
  
  # Merge the count back 
  shape <- shape %>%
    left_join(count_per_shape, by = "ID")
  
  # Using mutate() and replace_na()
  shape <- shape %>%
    mutate(Cell_Towers = replace_na(Cell_Towers, 0))

  # Count the number of observations
  num_observations <- nrow(shape)

  # Generate plots (modify your plot function to save plots)
  generate_plots(shape, country_name)
  
  # Data preparation for model
  st_geometry(shape) <- NULL
  train_indices <- sample(nrow(shape), size = floor(0.7 * nrow(shape)))
  train_set <- shape[train_indices, ]
  test_set <- shape[-train_indices, ]

  # Define dependent variables and predictors
  dvs <- "Population"
  predictors <- c("Sigma_stack", "Mean_stack")
  controls <- c("GLOBCOVER", "Min_Temp", "Precipitation", "Elevation", "Albedo", "Cell_Towers")

  # Store RMSEs
  rmse_results <- list(Land = list(), OLS = list(), BART = list(),XGBoost=list() , SVM = list(), RF = list())

for (predictor in predictors) {
  # OLS Model with Polynomial terms 
  print(paste(predictor, "Starting"))
  control_terms <- paste(controls, collapse = " + ")
  formula <- as.formula(paste(dvs, "~ poly(", predictor, ", 2) + ", control_terms))
  model <- lm(formula, data = train_set)
  predictions <- predict(model, newdata = test_set)
  rmse_results$OLS[[predictor]] <- calculate_rmse(predictions, test_set[, dvs])
  
  # LandScan
  model <- lm(Population ~ poly(LandScan, 1), data = train_set)
  predictions <- predict(model, newdata = test_set)
  rmse_results$Land[[predictor]] <- calculate_rmse(predictions, test_set[, dvs])
  
  
  
  print(paste(predictor, "OLS Complete"))

  print(paste(predictor, "BART Starting"))
  # BART Model
  rmse_results$BART[[predictor]] <- calculate_bart_rmse(predictor, controls, train_set, test_set, dvs)
    print(paste(predictor, "BART Complete"))

    # XGBoost Model
  rmse_results$XGBoost[[predictor]] <- calculate_xgboost_rmse(predictor, controls, train_set, test_set, dvs)
  print(paste(predictor, "XGBoost Complete"))
    

  # SVM Model
  rmse_results$SVM[[predictor]] <- calculate_svm_rmse(predictor, controls, train_set, test_set, dvs)
  
  print(paste(predictor, "SVM Complete"))


  # Random Forest Model
  rmse_results$RF[[predictor]] <- calculate_rf_rmse(predictor, controls, train_set, test_set, dvs)

  print(paste(predictor, "RF Complete"))

}

 # Split the data frame
landscan_df <- data.frame(LandScan = rmse_results$Land)
other_models_df <- cbind(OLS = rmse_results$OLS, BART = rmse_results$BART, GBM = rmse_results$XGBoost, SVM = rmse_results$SVM, RF = rmse_results$RF)


# Format row names
formatted_row_names <- gsub("_", " ", rownames(other_models_df))
formatted_row_names <- tools::toTitleCase(formatted_row_names)
rownames(landscan_df) <- "LandScan"
rownames(other_models_df) <- formatted_row_names







# Generate LaTeX tables
caption_landscan <- paste("LandScan for", country_name, "- Number of observations:", num_observations)
caption_other_models <- paste("Out of Sample RMSEs for Population for", country_name, "- Number of observations:", num_observations)

latex_landscan <- xtable(landscan_df, caption = caption_landscan)
latex_other_models <- xtable(other_models_df, caption = caption_other_models, label = paste("table:rmse_", country_name, sep = ""))




# Save RMSE table as LaTeX
table_filename <- paste0("Tables/RMSE_for_", country_name, ".tex")
sink(table_filename)

# Begin table environment
cat("\\begin{table}[ht]\n")

# Add a single caption for the entire table
full_caption <- paste("Out of Sample RMSEs for Population and LandScan for", country_name, "- Number of observations:", num_observations)
cat("\\caption{", full_caption, "}\n")

# Print LandScan table
cat("\\begin{minipage}{.5\\linewidth}\n")
print(latex_landscan, include.rownames = TRUE, floating = FALSE, hline.after = NULL, comment = FALSE)
cat("\\end{minipage}%\n")

# Print Other Models table
cat("\\begin{minipage}{.5\\linewidth}\n")
print(latex_other_models, include.rownames = TRUE, floating = FALSE, hline.after = NULL, comment = FALSE)
cat("\\end{minipage}\n")

# End table environment
cat("\\end{table}\n")

sink()


csv <- cbind(other_models_df, LandScan = as.vector(landscan_df), Stack = row.names(other_models_df), Country = country_name, N = num_observations)
csv_path <- paste0("Tables/RMSE_for_", country_name, ".csv")
write.csv(csv, csv_path)

results <- data.frame(OLS = unlist(rmse_results$OLS), BART = unlist(rmse_results$BART), GBM = unlist(rmse_results$XGBoost), SVM = unlist(rmse_results$SVM), RF = unlist(rmse_results$RF), Country = country_name, N = num_observations, LandScan = landscan_df[1,1], Stack = formatted_row_names)

calculate_and_save_descriptive_stats(shape, country_name)


all_data <- rbind(all_data, results)
}

Final Visualization and Analysis

With our models trained and evaluated, it’s time to visualize the performance across different countries. The goal is to compare the Root Mean Squared Error (RMSE) values for various models and see how they stack up against the LandScan baseline.

Data Transformation:

We begin by combining all the data into a single dataframe and converting it to a long format for easier plotting. This involves filtering out any outliers (like Portugal in this case) and restructuring the data so that each row represents a model’s RMSE value for a specific country.

LandScan Baseline:

For context, we include the LandScan RMSE values as horizontal dashed lines in our plots. This allows us to see how our models compare to this established baseline.

Creating the Plot:

We use ggplot2 to create a bar plot that shows the RMSE values for each model, with different shades representing the stack types. The plot is faceted by country, making it easy to compare model performance across different regions. By setting the angle and hjust parameters in the theme function, we ensure that the axis labels are readable even when they are rotated for better visualization.

Saving the Plot:

Finally, we save our plot as a PDF file, ensuring that it is properly formatted for sharing and presentation.

By visualizing the model performance in this way, we can easily identify which models perform best in different countries and how they compare to the LandScan baseline. This comprehensive analysis provides valuable insights into the effectiveness of using nighttime lights data for population estimation.

df <- all_data

row.names(df)<- NULL

# Transform data to long format
df_long <- df |> 
  filter(Country != "Portugal") |>
  pivot_longer( cols = c(OLS, BART, GBM, SVM, RF), names_to = "Model", values_to = "Value")

# Convert 'Model' to a factor with levels in the desired order
df_long$Model <- factor(df_long$Model, levels = c("OLS", "BART", "GBM", "SVM", "RF"))

# Create a new data frame for LandScan values by country
landscan_values <- df  |> 
  filter(Country != "Portugal") |>
  # filter(Country != "Philippines") |>
  dplyr::select(Country, LandScan) %>% 
  distinct(Country, LandScan)

# Plot with LandScan horizontal lines for each country
p2<- ggplot(df_long, aes(x = Model, y = Value, fill = Stack)) + 
  geom_bar(stat = "identity", position = "dodge") +
  geom_hline(data = landscan_values, aes(yintercept = LandScan), color = "black", linetype = "dashed") +
  scale_fill_manual(values = c("lightgrey", "darkgrey")) +
  facet_wrap(~ Country) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "Model Performance by Country", x = "Model", y = "RMSE")

p2

Conclusion

Our analysis shows that the models we developed perform comparably to the established LandScan dataset from Oak Ridge National Laboratory (ORNL) in estimating populations using nighttime lights data from NASA. This outcome demonstrates the potential of using satellite imagery for population estimation, providing a viable alternative to traditional methods.

Key Takeaways:

  1. Model Performance:
    • The RMSE values for our models are in line with those from the LandScan dataset, indicating that our approach is robust and reliable.
    • Different models, including OLS, BART, SVM, Random Forest, and XGBoost, show varying levels of effectiveness across different countries, highlighting the importance of model selection based on specific regional characteristics.
  2. Data Integration:
    • Incorporating additional data, such as cell tower locations, enhances the accuracy of our population estimates, demonstrating the value of integrating multiple data sources.
    • Handling missing and extreme values effectively is crucial for maintaining the quality and reliability of our models.
  3. Visualization and Interpretation:
    • Visualizing model performance across different countries provides clear insights into where our models excel and where improvements are needed.
    • The comparison with LandScan baseline values serves as a useful benchmark for assessing the effectiveness of our models.

Future Directions:

While our models perform well, there is always room for improvement. Future work could focus on: - Enhancing Data Quality: Acquiring higher-resolution data and improving data preprocessing techniques to reduce noise and improve accuracy. - Model Optimization: Experimenting with additional model architectures and hyperparameter tuning to further enhance performance. - Expanding Data Sources: Integrating more diverse data sources, such as socioeconomic indicators or high-resolution satellite imagery, to provide a more comprehensive view of population distribution.

In conclusion, this project demonstrates the feasibility and effectiveness of using nighttime lights data for population estimation. Our models perform on par with the LandScan dataset, offering a valuable tool for urban planning, disaster response, and resource management. By continuing to refine and expand upon this work, we can further enhance our ability to make informed decisions based on satellite imagery.

Daniel K Baissa

Home