Spatial Predictions and Model Assesment

Code
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(terra)
terra 1.8.70

Attaching package: 'terra'

The following object is masked from 'package:tidyr':

    extract
Code
library(geodata)
library(tidyterra)

Attaching package: 'tidyterra'

The following object is masked from 'package:stats':

    filter
Code
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

Getting Started

This exercise is going to rely on simulated data so you don’t need to join a repository. You’ll just need to get our credentials and login to a new RStudio session. We’re going to keep using the simulation so that you know what the true values are and so you can explore some of the different ways of assessing model fit.

Get the Data

We’ll use the same approache we used before to download data for Costa Rica using the geodata package. This should (hopefully) look familiar to you.

Code
set.seed(123)
n_points <- 1000

elevation <- geodata::elevation_30s(country = "CRI", 
                                    path = tempdir())
climate <- geodata::worldclim_country(country = "CRI", 
                                      var = "bio", 
                                      res = 0.5, 
                                      path = tempdir())
names(climate) <- str_extract(names(climate), "bio_\\d+")
names(elevation) <- "elev"

climate_df <- as.data.frame(climate, na.rm=TRUE)

climate_subset <- climate[[sample(1:nlyr(climate),size = 8)]]
climate_aligned <- resample(climate_subset, elevation)

# Stack the predictors
predictors <- c(elevation, climate_aligned)


plot(predictors)

Generate a sample

Next we generate a sample of observations, build our dataframe (using spatSample and extract). You’ve done this before too.

Code
# Sample random locations from the study area
sample_points <- spatSample(elevation, size = n_points, 
                           method = "random", na.rm = TRUE, 
                           xy = TRUE, values = FALSE)

# Extract environmental values at these points
env_values <- terra::extract(predictors, sample_points)

species_data <- tibble(
  x = sample_points[,1],
  y = sample_points[,2]) %>% 
  bind_cols(env_values) %>% 
  drop_na()  # Remove any NA values

species_data_scl <- species_data %>% 
  mutate(across(colnames(env_values), ~ as.numeric(scale(.x))))

Specify predictors

Here we’re going to simplify things a little from before. Rather than allowing for correlations between predictors, we’ll just draw a handful of \(\beta\)s from a uniform distribution. Things are already tough enough for the logistic regression given that so many of these variables are likely to be correlated (we’ll return that shortly).

Code
K <- nlyr(predictors)


known_betas <- runif(n = K, min = -3, max = 3)

Simulate Outcomes

Now we simulate outcomes and check to make sure we don’t end up with too many 1’s or 0’s. We use the plogis as means of converting the linear predictor to the probability scale. If you wanted to simulate different outcomes, you’d need to think through the appropriate link function.

Code
species_data_preds <- species_data_scl %>%
  mutate(
    across(
      all_of(colnames(env_values)),
      ~ .x * known_betas[which(colnames(env_values) == cur_column())]
    )
  ) %>% 
  mutate(. , linpred = rowSums(across(all_of(colnames(env_values)))),
         prob = plogis(1 + linpred),
         presence = rbinom(n(), size = 1, prob = prob)) 


# Check the prevalence
species_data_preds %>%
  summarize(
    prevalence = mean(presence),
    n_present = sum(presence),
    n_absent = sum(presence == 0)
  )  
# A tibble: 1 × 3
  prevalence n_present n_absent
       <dbl>     <int>    <int>
1      0.525       525      475

Fit Logistic Regression

We use the vars and form objects as a way to make this as flexible as possible (in case you end up with different predictors). In this case, it will help us later when we go to fit competing models to our initial “saturated” model.

Code
vars <- species_data_preds %>%
  select(colnames(env_values)) %>%
  names()

form <- reformulate(vars, response = "presence")
form
presence ~ elev + bio_15 + bio_14 + bio_3 + bio_10 + bio_2 + 
    bio_6 + bio_11 + bio_5
Code
model <- glm(form ,
             data = species_data_preds,
             family = binomial(link = "logit"))

# Model summary
summary(model)

Call:
glm(formula = form, family = binomial(link = "logit"), data = species_data_preds)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0729     0.2243   4.783 1.73e-06 ***
elev          2.4877     1.9792   1.257 0.208790    
bio_15        0.7154     0.1896   3.773 0.000162 ***
bio_14       11.5890     6.9602   1.665 0.095902 .  
bio_3         0.6806     1.0870   0.626 0.531235    
bio_10       -4.8985     4.5918  -1.067 0.286064    
bio_2         1.5170     3.7487   0.405 0.685713    
bio_6        -0.5996    13.1232  -0.046 0.963557    
bio_11       -5.4615     7.8298  -0.698 0.485477    
bio_5         0.6402    12.1687   0.053 0.958044    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.79  on 999  degrees of freedom
Residual deviance:  252.21  on 990  degrees of freedom
AIC: 272.21

Number of Fisher Scoring iterations: 8
Code
# Compare estimated coefficients to true values
tibble(
  Parameter = c("Intercept", colnames(env_values)),
  True_Value = c(0, known_betas),
  Estimated = coef(model),
  Difference = Estimated - True_Value,
  Pct_Error = (Difference / True_Value) * 100
)
# A tibble: 10 × 5
   Parameter True_Value Estimated Difference Pct_Error
   <chr>          <dbl>     <dbl>      <dbl>     <dbl>
 1 Intercept     0          1.07        1.07     Inf  
 2 elev         -1.10       2.49        3.58    -327. 
 3 bio_15        2.69       0.715      -1.97     -73.4
 4 bio_14       -0.0585    11.6        11.6   -19922. 
 5 bio_3         2.84       0.681      -2.16     -76.0
 6 bio_10        2.08      -4.90       -6.97    -336. 
 7 bio_2         2.88       1.52       -1.36     -47.3
 8 bio_6        -2.32      -0.600       1.72     -74.1
 9 bio_11       -1.12      -5.46       -4.35     389. 
10 bio_5         2.67       0.640      -2.03     -76.0

You can see from the results that we didn’t do a particularly good job of recovering the true values. Maybe more importantly, it would appear that the model didn’t do a particularly good job of converging (given the size of several of the coefficients). Remember that for logisitc regression values outside \(|\neg5, 5|\) translate to log-odds that are essentially 0 or 1, respectively.

Cohen’s Likelihood Ratio

While we can see that there are some pathologies with our model, the goal here is to decide how well this model fits the data. Because this is a logistic regression, we can’t use the standard \(R^2\) statistic. Instead we’ll look at two different Pseudo-\(R^2\) values.

To calculate Cohen’s Likelihood Ratio, we can just use elements provided to us with the model object. Namely the null.deviance and the deviance. Remember that lower values are better here (i.e., the model deviance is roughly equal that of a hypothetical perfect model). Here, we see that the value is less than 1, but larger than 0. It’s hard to know whether this is too high or not, so we’ll take a look at the Cox and Snell.

Code
(model$null.deviance - model$deviance)/model$null.deviance
[1] 0.8177366

Cox and Snell

To calculate the Cox and Snell estimate, we need another model. In this case, instead of a hypotetically perfect model, we want a null model. This model tells us what we can learn without including any predictors. If our predictors have information, then the logLik of our model should be higher than that of the null. If you’re uncertain of this, experiment with running exp() for both positive and negative numbers. The larger the negative number, the smaller the decimal produced by exp and the larger the result of the equation (because we’re subtracting from 1). Here, we can see that the model isn’t terrible, but it’s not particularly good either (because we’re a ways a way from 1).

Code
model_null <-  glm(presence ~ 1, 
                     family=binomial(link="logit"),
                     data=species_data_preds)

1 - exp(2*(logLik(model_null)[1] - logLik(model)[1])/nobs(model))
[1] 0.6774762

…But There’s a Problem

Our assessments of fit suggest the model isn’t great, but it’s not terrible either. That said, we know (from the magnitude of some of the effects and their comparison to the true values) that the model doesn’t seem to have converged particularly well. This is probably because there’s a lot of underlying correlation between the values in our predictors. We can verify that quickly using cor

Code
cor(env_values)
              elev     bio_15       bio_14      bio_3       bio_10        bio_2
elev    1.00000000 -0.1201272  0.027257116  0.6529647 -0.992729314  0.059331843
bio_15 -0.12012717  1.0000000 -0.890350092 -0.3843959  0.204882300  0.614834848
bio_14  0.02725712 -0.8903501  1.000000000  0.3041069 -0.099955394 -0.589917227
bio_3   0.65296468 -0.3843959  0.304106944  1.0000000 -0.685935715  0.145746938
bio_10 -0.99272931  0.2048823 -0.099955394 -0.6859357  1.000000000  0.008952614
bio_2   0.05933184  0.6148348 -0.589917227  0.1457469  0.008952614  1.000000000
bio_6  -0.99103441  0.1022025 -0.003175772 -0.6805260  0.987166682 -0.145459137
bio_11 -0.99391043  0.1946638 -0.093459285 -0.6728810  0.999432952  0.002568271
bio_5  -0.95985646  0.3384505 -0.229050227 -0.6787117  0.982646015  0.186577246
              bio_6       bio_11      bio_5
elev   -0.991034411 -0.993910426 -0.9598565
bio_15  0.102202506  0.194663770  0.3384505
bio_14 -0.003175772 -0.093459285 -0.2290502
bio_3  -0.680526032 -0.672880982 -0.6787117
bio_10  0.987166682  0.999432952  0.9826460
bio_2  -0.145459137  0.002568271  0.1865772
bio_6   1.000000000  0.988496605  0.9421546
bio_11  0.988496605  1.000000000  0.9803267
bio_5   0.942154645  0.980326739  1.0000000

Removing correlated variables

The caret package gives us an easy way to identify (and potentially get rid of) correlated variables using the findCorrelation function. You’ll see that we’ve got at least 5 variables that could be dropped if we don’t want correlations to create challenges for our model. We subset our predictor vector (env_values) to remove the variables in our high_corr vector (using subset with !). Then we can fit this new subset model.

Code
cormat <- cor(species_data_preds[,colnames(env_values)], use = "pairwise.complete.obs")
high_corr <- findCorrelation(cormat, cutoff = 0.7, names=TRUE)

sub_vars <- subset(colnames(env_values), !(colnames(env_values) %in% high_corr))

vars_sub <- species_data_preds %>%
  select(all_of(sub_vars)) %>%
  names()

form_sub <- reformulate(vars_sub, response = "presence")
form_sub
presence ~ elev + bio_14 + bio_3 + bio_2
Code
model_sub <- glm(form_sub ,
             data = species_data_preds,
             family = binomial(link = "logit"))

# Model summary
summary(model_sub)

Call:
glm(formula = form_sub, family = binomial(link = "logit"), data = species_data_preds)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.9049     0.2060   4.394 1.12e-05 ***
elev          1.2245     0.2228   5.495 3.91e-08 ***
bio_14       33.0633     4.9865   6.631 3.34e-11 ***
bio_3         0.4280     0.1088   3.935 8.31e-05 ***
bio_2         1.6344     0.1645   9.936  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.79  on 999  degrees of freedom
Residual deviance:  274.03  on 995  degrees of freedom
AIC: 284.03

Number of Fisher Scoring iterations: 8
Code
# Compare estimated coefficients to true values
results_sub <- tibble(
  Parameter = c("(Intercept)", colnames(env_values)),
  True_Value = c(0, known_betas)) 
results_sub$Estimated <- coef(model_sub)[results_sub$Parameter]  

results_sub %>% 
  mutate(
  Difference = Estimated - True_Value,
  Pct_Error = (Difference / True_Value) * 100
)
# A tibble: 10 × 5
   Parameter   True_Value Estimated Difference Pct_Error
   <chr>            <dbl>     <dbl>      <dbl>     <dbl>
 1 (Intercept)     0          0.905      0.905     Inf  
 2 elev           -1.10       1.22       2.32     -212. 
 3 bio_15          2.69      NA         NA          NA  
 4 bio_14         -0.0585    33.1       33.1    -56652. 
 5 bio_3           2.84       0.428     -2.41      -84.9
 6 bio_10          2.08      NA         NA          NA  
 7 bio_2           2.88       1.63      -1.24      -43.2
 8 bio_6          -2.32      NA         NA          NA  
 9 bio_11         -1.12      NA         NA          NA  
10 bio_5           2.67      NA         NA          NA  

Assessing Fit

Now, we know this model is not properly specified (because our simulation used all the variables), but maybe the fit statistics will help us figure out which model we might want to use (if any). We’ll calculate the Likelihood Ratio and the Cox and Snell statistic.

Code
(model_sub$null.deviance - model_sub$deviance)/model_sub$null.deviance
[1] 0.8019731
Code
1 - exp(2*(logLik(model_null)[1] - logLik(model_sub)[1])/nobs(model_sub))
[1] 0.6703636

Unfortunately, this doesn’t help us much as the Likelihood Ratio is lower (so better than the full model), but so is the Cox and Snell (so worse than the full model).

Comparing Our Models

We can tell from this quick run through, that none of the models fit the data particularly well. But maybe we’d like to know which one does the best job. We can do that using the AIC function and providing all of the models we’d like to compare.

Code
AIC(model, model_sub, model_null)
           df       AIC
model      10  272.2149
model_sub   5  284.0284
model_null  1 1385.7933

Based on this, it would appear that the full model has the best theoretical predictive qualities. That is the log likelihood for the full model is substantially larger than the other two models. Moreover, it’s large enough to overcome the penalty for having twice as many predictors. That said, the model didn’t converge so we might interpret these results cautiously.

Compare Model Implications

Our various tests have been pretty ambiguous - there’s no clear “best choice” for which we might use. One last check we might do, is to examine the implications of predictions from the models (as this where instability will become the most obvious). We can do that easily using the predict functions included with terra.

Code
scale_stack <- scale(predictors)
pred_model <- terra::predict(
  object = scale_stack,
  model = model,
  type = "response"
)

pred_sub <- terra::predict(
  object = scale_stack,
  model = model_sub,
  type = "response"
)

pred_null <- terra::predict(
  object = scale_stack,
  model = model_null,
  type = "response"
)

pred_stacks <- c(pred_model, pred_sub, pred_null)

names(pred_stacks) <- c("Full", "Subset", "Null")

ggplot() +
  geom_spatraster(data = pred_stacks) +
  facet_wrap(~lyr) +
  scale_fill_viridis_c(na.value = "transparent") +
  labs(title = "Faceted SpatRaster Layers", fill = "Value") +
  theme_bw()

The plots help clarify two important things: 1) the predictions from the intercept-only model are (as expected) constant and relatively high (considering there’s no extra info in the model) and 2) the saturated model imposes a complete separation between presences and absences which indicates that we’ve perfectly separated the data generating process (unlikely). The results from the subset model seem the most plausible despite some of the diagnostics suggesting it was worse than the full model.

Take Aways

The key things I want you to remember are: 1) single tests or diagnostics are (almost) never enough 2) Each step of this process is (differently) sensitive to violations of assumptions 3) Fully interogating a model is an interative process