--- title: "Intro to spind" author: "Sam Levin, Gudrun Carl, Ingolf Kuehn" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Intro to spind} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- `spind` is a package dedicated to removing the spectre of spatial autocorrelation in your spatial models. It contains many of the tools you need to make predictions, assess model performance, and conduct multimodel inference for 2-D gridded data sets using methods that are robust to spatial autocorrelation. The theory underlying the use of GEEs, WRMs, and many of the other tools in this package is covered elsewhere in the literature, and for the purposes of this vignette, we assume that you have already read those papers. If you haven't, citations are included in the footnotes of this vignette as well as in the documentation of each function. We also assume that you have a working knowledge of how to use _R_. This vignette will focus on demonstrating how to utilize this package to create a robust model from spatially referenced data and subsequently assess its accuracy. Along the way, we will use a couple different data sets to examine how these functions work and investigate how one might use them to create a spatially robust model. This particular demonstration will focus on species distribution models (hereafter referred to as SDMs), but this general framework can be applied to any data set that is spatially structured (e.g. economic, sociological). ## Generalized Estimating Equations (GEEs) for species distribution modeling [^1] This package utilizes the functions already written for GEEs from the packages `gee`[^2] and `geepack`[^3] and adapts them for easy use in the context of an SDM. Let's start with a fairly simple GEE using the simulated `musdata` data set included in the package. Before we get started though, note that `GEE` requires that **predictor variables be continuous variables**. [^1]: Carl G & Kuehn I, 2007. Analyzing Spatial Autocorrelation in Species Distributions using Gaussian and Logit Models, Ecol. Model. 207, 159 - 170] [^2]: Carey, V. J., 2006. Ported to R by Thomas Lumley (versions 3.13, 4.4, version 4.13)., B. R. gee: Generalized Estimation Equation solver. R package version 4.13-11. [^3]: Yan, J., 2004. geepack: Generalized Estimating Equation Package. R package version 0.2.10. ```{r Library call, echo=FALSE} library(spind) library(ggplot2) knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r GEE Data Infiling, eval=FALSE} data(musdata) data(carlinadata) # Examine the structure to familiarize yourself with the data ?musdata head(musdata) ?carlinadata head(carlinadata) ``` ```{r GEE Example, fig.width=7.15,fig.height=5} # Next, fit a simple GEE and view the output coords <- musdata[ ,4:5] mgee <- GEE(musculus ~ pollution + exposure, family = "poisson", data = musdata, coord = coords, corstr = "fixed", scale.fix = FALSE) summary(mgee, printAutoCorPars = TRUE) plot(mgee) predictions <- predict(mgee, newdata = musdata) # you can modify the plot itself by extracting it from the plot object and # treating it as any other ggplot object. library(ggplot2) my_plot <- mgee$plot # more of a base-R graphic feel my_plot + theme(plot.background = element_rect(fill = NA, color = 'black', size = 1.25)) ``` As you can see, this package includes methods for `plot`, `summary` and `predict`. These are useful in evaluating model fit and autocorrelation of residuals compared to a non-spatial model (in this case, a GLM with the same family as the GEE). The `plot.GEE` method generates a plot of the autocorrelation of the residuals from the fitted `GEE` and a `GLM` of the same family for comparison. Note that a QIC (Quasi-information criterion) score is reported as opposed to AIC. This is calculated based on the method described in Hardin & Hilbe[^4]^,^[^5] and is implemented using the function `qic.calc`. Also note that trying to fit GEEs with `corstr = "fixed"` to large data sets (i.e. number of observations is approximately `sqrt(.Machine$integer.max)`) will produce errors as the resulting variance-covariance matrices will be too large to be handled in _R_ (you may well run into problems before this point depending on how much RAM you have available). This is where fitting clustered models can come in handy, as they work with smaller, more manageable matrices. These can be specified by changing the `corstr` to either `"quadratic"` or `"exchangeable"`. [^4]: Hardin, J.W. & Hilbe, J.M. (2003) Generalized Estimating Equations. Chapman and Hall, New York. [^5]: Barnett et al. Methods in Ecology & Evolution 2010, 1, 15-24. ## Wavelet Revised Models (WRMs) [^6] Next, we'll examine the other main model that is introduced in this package - the Wavelet Revised Model. These are implemented using wavelet transforms from the `waveslim` package.[^7] Let's start with a fairly simple WRM using the same `musdata` data set as above. As with `GEE`, `WRM` also requires that predictor variables are continuous. [^6]: Carl, G., Kuehn, I. (2010): A wavelet-based extension of generalized linear models to remove the effect of spatial autocorrelation. Geographical Analysis 42 (3), 323 - 337 [^7]: Whitcher, B. (2005) Waveslim: basic wavelet routines for one-, two- and three-dimensional signal processing. R package version 1.5. ```{r WRM Example, fig.width = 7.15, fig.height = 5} mwrm <- WRM(musculus ~ pollution + exposure, family = "poisson", data = musdata, coord = coords, level = 1) plot(mwrm) summary(mwrm) predictions <- predict(mwrm, newdata = musdata) ``` `WRM` has many of the same features as `GEE`. The `plot.WRM` allows you to visually examine the autocorrelation of residuals from a GLM of the same error family as your WRM. Methods for `predict` and `summary` allow you to examine outputs from the model using the same code as you might use for a GLM. However, note that this reports an AIC score, rather than a QIC score as in the GEE. You can extract and modify the `ggplot` output from `WRM` as well using the `plot` slot in the `WRM` class. #### Other features specific to WRMs `WRM` has a number of other model-specific functions that you may find useful in diagnosing model fit and understanding your results. For example, you might want to plot the variance and/or covariance of each of your variables as a function of `level`. The `covar.plot` function allows you to visually examine the wavelet relationships from your model. You can select whether to plot the variance or covariance using `plot` argument which accepts `"var"` or `"covar"` as inputs. Note that we are going to switch to the `carlinadata` data set now. ```{r Covar.plot Example, fig.width = 7.15, fig.height = 5} coords <- carlinadata[ ,4:5] wave_covariance <- covar.plot(carlina.horrida ~ aridity + land.use - 1, data = carlinadata, coord = coords, wavelet = "d4", wtrafo = 'modwt', plot = 'covar') wave_variance <- covar.plot(carlina.horrida ~ aridity + land.use - 1, data = carlinadata, coord = coords, wavelet = "d4", wtrafo = 'modwt', plot = 'var') wave_variance$result wave_covariance$result # view plots side by side library(gridExtra) grid.arrange(wave_variance$plot, wave_covariance$plot) ``` You may also want to view the smooth components of your wavelets at different `scale`s. For this, we offer the `upscale` function. `upscale` allows you to visually examine your data at a number of different levels of `scale` which controls the resolution of the grid cells in your data set.[^8] It also offers the option to adjust padding settings so you can see how that influences your smooth components. The default padding level is the mean value of your input vector, but it can be easily switched using the `pad` argument, which works the same way as in the other `WRM` functions. A quick example below using `carlinadata` data set. **NOTE:** The `color.maps` argument is soft-deprecated from `spind >= v2.2`. [^8]: Carl G, Doktor D, Schweiger O, Kuehn I (2016) Assessing relative variable importance across different spatial scales: a two-dimensional wavelet analysis. Journal of Biogeography 43: 2502-2512. ```{r Upscale Example, fig.width = 7.15, fig.height = 7} upscale(carlinadata$land.use, coord = coords, pad = mean(carlinadata$land.use)) ``` ## Multi-model inference with GEEs and WRMs `spind` provides a couple of frameworks for conducting multi-model inference analyses and some helper functions to assist you when examining the results. The first that we'll examine here is the `step.spind` function, which implements step-wise model selection. The process is loosely based on `MASS::stepAIC` and `stats::step`, but is specific to classes `GEE` and `WRM`. For GEEs, `step.spind` uses models with the lowest QIC scores to determine what the next step will be. For WRMs, you have the option of using AIC or AICc (AIC corrected for small sample sizes) using the logical `AICc` argument. Currently, the function only supports backwards model selection. In other words, you have to start with all of the variables in your model formula and remove them in a stepwise fashion. We hope to add forward model selection methods shortly. Additionally, `step.spind` is written to always respect the hierarchy of variables in the model and currently the user cannot directly override this. For example, `step.spind` would not remove `race` while retaining `I(race^2)`. We may change that in the future, but it will remain like this at least until the next major release. Currently, it recognizes polynomial variables by matching variable names located inside of `I(var^some_power)` and interaction terms by searching for `var1:var2` in the model terms. If you want to use a higher order polynomial variable and are not worried about the variable hierarchy, you can create a separate variable (i.e. `race_2`) and use that in the model. We'll go through an example of `step.spind` using a GEE on the `birthwt` data set in the `MASS` package below. The data in `birthwt` aren't at all related to SDMs and are not spatially structured, but we hope that in using this data set, we will demonstrate how this function can work with many types of data sets. ```{r Step.spind Example} # For demonstration only. We are artificially imposing a grid structure # on data that is not actually spatial data library(MASS) data(birthwt) x <- rep(1:14, 14) y <- as.integer(gl(14, 14)) coords <- cbind(x[-(190:196)], y[-(190:196)]) formula <- formula(low ~ age + lwt + race + smoke + ftv + bwt + I(race^2)) mgee <- GEE(formula, family = "gaussian", data = birthwt, coord = coords, corstr = "fixed",scale.fix = TRUE) mwrm <- WRM(formula, family = "gaussian", data = birthwt, coord = coords, level = 1) ssgee <- step.spind(mgee, birthwt) sswrm <- step.spind(mwrm, birthwt, AICc = TRUE) best.mgee <- GEE(ssgee$model, family = "gaussian", data = birthwt, coord = coords, corstr = "fixed",scale.fix = TRUE) best.wrm <- WRM(sswrm$model, family = "gaussian", data = birthwt, coord = coords, level = 1) summary(best.mgee, printAutoCorPars = FALSE) summary(best.wrm) ``` Additionally, we offer multimodel inference tools for GEEs and WRMs which are loosely based on the `MuMIn` package. These are implemented in `mmiWMRR` and `mmiGEE`. They enable you to examine the effect that the grid resolution and variable selection have on the resulting regressions, and then select the appropriate model for subsequent analyses. Note that `mmiWMRR` has two more arguments than `mmiGEE` that must be specified. ```{r mmi... example} # Example for WRMs data(carlinadata) coords <- carlinadata[ ,4:5] wrm <- WRM(carlina.horrida ~ aridity + land.use, family = "poisson", data = carlinadata, coord = coords, level = 1, wavelet = "d4") ms1 <- scaleWMRR(carlina.horrida ~ aridity + land.use, family = "poisson", data = carlinadata, coord = coords, scale = 1, wavelet = 'd4', trace = FALSE) mmi <- mmiWMRR(wrm, data = carlinadata, scale = 1, detail = TRUE) # Example for GEEs library(MASS) data(birthwt) # impose an artificial (not fully appropriate) grid structure x <- rep(1:14, 14) y <- as.integer(gl(14, 14)) coords <- cbind(x[-(190:196)], y[-(190:196)]) formula <- formula(low ~ race + smoke + bwt) mgee <- GEE(formula, family = "gaussian", data = birthwt, coord = coords, corstr = "fixed", scale.fix = TRUE) mmi <- mmiGEE(mgee, birthwt) ``` Finally, we offer one further model selection procedure specific to WRMs. `rvi.plot` uses `mmiWMRR` and creates a plot of the relative importance of each explanatory variable as a function of the resolution of the grid (in other words, as a function of the `scale` argument in `mmiWMRR`). It will also print the resulting model selection tables to the console. ```{r RVI.plot Example, fig.width=7.15, fig.height=5} data(carlinadata) coords <- carlinadata[ ,4:5] rvi <- rvi.plot(carlina.horrida ~ aridity + land.use, family = "poisson", data = carlinadata, coord = coords, maxlevel = 4, detail = TRUE, wavelet = "d4") rvi$rvi rvi$plot ``` ## Goodness of fit and model performance You may also find that a model not implemented by this package works best for your data. We've implemented some spatially corrected accuracy measures that you can use to assess goodness of model fit. The first two of these are categorized according to whether or not their outputs are dependent on the chosen threshold and first appeared in `spind v1.0` [^9]. `th.dep` (threshold dependent) and `th.indep` (threshold independent) are designed to work on any number of model types, all you need is a set of actual values, predictions, and their associated coordinates. We'll use the `hook` data set to see how these work. [^9]: Carl G, Kuehn I (2017) Spind: a package for computing spatially corrected accuracy measures. Ecography 40: 675-682. doi: 10.1111/ecog.02593 ```{r GOF data, eval = FALSE} data(hook) # Familiarize yourself with the data ?hook head(hook) ``` ```{r Spatial Indices Example, fig.width = 7.15, fig.height = 5} df <- hook[ ,1:2] coords <- hook[ ,3:4] # Threshold dependent metrics th.dep.indices <- th.dep(data = df, coord = coords, spatial = TRUE) # Confusion Matrix th.dep.indices$cm # Kappa statistic th.dep.indices$kappa # Threshold independent metrics th.indep.indices <- th.indep(data = df, coord = coords, spatial = TRUE) # AUC th.indep.indices$AUC # TSS th.indep.indices$TSS # AUC plot th.indep.indices$plot ``` Finally, many analyses require a calculation of spatial autocorrelation. To that end, we include the function `acfft` (AutoCorrelation Fast Fourier Transform) to calculate spatial autocorrelation using Moran's I statistic. While many other packages include functions to perform this analysis, ours provides improved efficiency by harnessing the power of fast Fourier transforms which reduce the time needed to compute the statistic. A quick example below using a GLM and the `musdata` data set. ```{r ACFFT example} coords <- musdata[ ,4:5] mglm <- glm(musculus ~ pollution + exposure, family = "poisson", data = musdata) ac <- acfft(coords, resid(mglm, type = "pearson"), lim1 = 0, lim2 = 1, dmax = 10) ac ``` Note that you can adjust the number of distance bins to examine in `acfft` using the `dmax` argument. The default is 10. Finally, you can adjust the size of the distance bins by adjusting the values of `lim1` and `lim2` to meet your needs. ## Wrapping up Hopefully, you are now ready to utilize GEEs and WRMs to conquer the world of spatial modeling. However, if this vignette has not served its purpose and you still have questions about how to use these tools, please let us know. Of course, no package is complete without bugs and we are always trying to improve our code. If you find any bugs that need squashing, have suggestions for additional functionality or improvements to existing functionality (or this vignette), please don't hesitate to contact us[^10]. [^10]: Contact email - or visit the Github repo and create an issue at .