Statistical Modeling with Logistic Regression

Getting Started

This exercise is going to rely on simulated data so you don’t need to join a repository. You’ll just need to get our credentials and login to a new RStudio session. I’m using simulation because one of the best ways to test out whether you’ve specified your model correctly is to simulate the process with known coefficients. If your statistical model returns those values, then, assuming you’ve simulated a reasonable generating process, you should be able to trust that the model you fit is actually doing what you want it to.

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(terra)
terra 1.8.70

Attaching package: 'terra'

The following object is masked from 'package:tidyr':

    extract
Code
library(geodata)
library(tidyterra)

Attaching package: 'tidyterra'

The following object is masked from 'package:stats':

    filter

Simulating data

Let’s imagine we’re developing a species distribution model for a species in Costa Rica. We’ll use the elevation data and bioclim variables from the geodata package to give us some spatial predictors. To keep this tractable (for now), we’ll just use the annual mean temp and precip from bioclim. We subset them using the [[]] notation, then resample them to make sure we’ve got everything aligned (you might remember that the elevation data is higher resolution than the climate data). We also use raster math to convert the temperature to Celsius (see the geodata help for why this is). Finally, we combine the data into one stack and plot each dataset to make sure we’ve got what we expected.

Code
elevation <- geodata::elevation_30s(country = "CRI", 
                                    path = tempdir())
climate <- geodata::worldclim_country(country = "CRI", 
                                      var = "bio", 
                                      res = 0.5, 
                                      path = tempdir())


# bio1 = Annual Mean Temperature
# bio12 = Annual Precipitation
temp <- climate[[1]]      # bio1
precip <- climate[[12]]   # bio12

temp_aligned <- resample(temp, elevation)
precip_aligned <- resample(precip, elevation)

# Stack the predictors
predictors <- c(elevation, temp_aligned, precip_aligned)
names(predictors) <- c("elevation", "temperature", "precipitation")

predictors$temperature <- predictors$temperature / 10

plot(predictors)

Now that we’ve got our predictors, it’s time to generate some random observations!! We start by using set.seed to ensure we’re all starting from the same place. We then use spatSample create a random sample of points within our study area and extract the predictor values to our points. Finally, we set up a nice tibble each observation and the relevant predictor values and drop any NAs that may have arisen from our random sample.

Code
set.seed(123)
n_points <- 500

# Sample random locations from the study area
sample_points <- spatSample(elevation, size = n_points, 
                           method = "random", na.rm = TRUE, 
                           xy = TRUE, values = FALSE)

# Extract environmental values at these points
env_values <- terra::extract(predictors, sample_points)

species_data <- tibble(
  x = sample_points[,1],
  y = sample_points[,2],
  elevation = env_values$elevation,
  temperature = env_values$temperature,
  precipitation = env_values$precipitation
) %>%
  drop_na()  # Remove any NA values

Now comes the fun part - specifying the TRUE regression coefficient values. These are arbitrary and you can/should change them to explore how correlations in the values and in the data might effect your fit. We first calculate the linear predictor (the model without the link function; logit_p) by multiplying each variable by its approrpiate beta and adding them together. Then we convert this to the logit scale by using plogis to convert our linear predictor to the probability scale (remember the logit link!). Finally, we use the rbinom function to pull a random draw for each location with a probability = to our prob estimate. You can see some summary statistics of how many presences, absences, and the proportion of points that are occupied by using summarize.

Code
# DEFINE THE TRUE COEFFICIENTS (these are what you should recover!)
true_beta0 <- -5        # intercept
true_beta1 <- 0.002     # elevation effect
true_beta2 <- 0.15      # temperature effect  
true_beta3 <- 0.0005    # precipitation effect

# Calculate the linear predictor (log-odds) and probability
species_data <- species_data %>%
  mutate(
    logit_p = true_beta0 + 
              true_beta1 * elevation + 
              true_beta2 * temperature + 
              true_beta3 * precipitation,
    prob = plogis(logit_p),
    presence = rbinom(n(), size = 1, prob = prob)
  )

# Check the prevalence
species_data %>%
  summarize(
    prevalence = mean(presence),
    n_present = sum(presence),
    n_absent = sum(presence == 0)
  )
# A tibble: 1 × 3
  prevalence n_present n_absent
       <dbl>     <int>    <int>
1      0.176        88      412

We can take a look at these laid over the elevation raster to get a sense for where they are. We can also reshape the data using pivot_longer to look at the distributions of each variable within the presence and absence category.

Code
ggplot() +
  geom_spatraster(data = elevation) +
  scale_fill_viridis_c(name = "Elevation (m)", alpha = 0.7) +
  geom_point(data = species_data, 
             aes(x = x, y = y, color = factor(presence)),
             size = 1, alpha = 0.6) +
  scale_color_manual(values = c("0" = "blue", "1" = "red"),
                     labels = c("Absent", "Present"),
                     name = "Species") +
  labs(title = "Species Distribution") +
  theme_minimal() +
  theme(legend.position = "right")

Code
# Compare environmental values between presence and absence
# Reshape data for easier plotting
plot_data <- species_data %>%
  select(presence, elevation, temperature, precipitation) %>%
  pivot_longer(cols = c(elevation, temperature, precipitation),
               names_to = "variable",
               values_to = "value") %>%
  mutate(
    presence = factor(presence, labels = c("Absent", "Present")),
    variable = factor(variable, 
                     levels = c("elevation", "temperature", "precipitation"),
                     labels = c("Elevation (m)", "Temperature (°C)", "Precipitation (mm)"))
  )

ggplot(plot_data, aes(x = presence, y = value, fill = presence)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~variable, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = c("Absent" = "blue", "Present" = "red")) +
  labs(title = "Environmental Conditions by Presence/Absence",
       x = "Species Presence",
       y = "Value") +
  theme_minimal() +
  theme(legend.position = "none")

Fit the Model

Now that we’ve got some observations with data and presence/absence assigned according to our known betas, we can test how well our model does using a simple call to glm. The first argument specifies the model formula and can be read as “Presence is a function of elevation + temperature + precipitation”. The second argument, data is just the dataframe where all of these variables exist. The final, family argument tells R what likelihood to use to estimate the parameters (and which link function to use). Once we’ve got the model, I just make a little summary to compare the estimated values to our true specified values.

Code
model <- glm(presence ~ elevation + temperature + precipitation,
             data = species_data,
             family = binomial(link = "logit"))

# Model summary
summary(model)

Call:
glm(formula = presence ~ elevation + temperature + precipitation, 
    family = binomial(link = "logit"), data = species_data)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)   11.3927247 11.2600485   1.012    0.312
elevation     -0.0013694  0.0022711  -0.603    0.547
temperature   -5.5589898  4.0816069  -1.362    0.173
precipitation  0.0001829  0.0002774   0.659    0.510

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 465.27  on 499  degrees of freedom
Residual deviance: 351.08  on 496  degrees of freedom
AIC: 359.08

Number of Fisher Scoring iterations: 5
Code
# Compare estimated coefficients to true values
tibble(
  Parameter = c("Intercept", "Elevation", "Temperature", "Precipitation"),
  True_Value = c(true_beta0, true_beta1, true_beta2, true_beta3),
  Estimated = coef(model),
  Difference = Estimated - True_Value,
  Pct_Error = (Difference / True_Value) * 100
)
# A tibble: 4 × 5
  Parameter     True_Value Estimated Difference Pct_Error
  <chr>              <dbl>     <dbl>      <dbl>     <dbl>
1 Intercept        -5      11.4       16.4         -328. 
2 Elevation         0.002  -0.00137   -0.00337     -168. 
3 Temperature       0.15   -5.56      -5.71       -3806. 
4 Precipitation     0.0005  0.000183  -0.000317     -63.4

Parting Thoughts

This is (obviously) a simple example, but you can use it to explore how sample size, correlations between data, and correlations between parameters affect your ability to recover parameters. This is important info to have because it can help you gauge how realistic your expectations of your own data might be. If the model never returns the correct values unless you have 10x more observations, you’ll know something about your sampling needs. If correlated data always causes the model to fail, you’ll have to think about ways to eliminate that correlation (maybe a PCA?). Simulation is a valuable and (in my opinion) underutilized tool. Now you’ve got the template to begin doing it!!