## R Tools

Below are some useful R scripts I’ve written for handling biodiversity data and species distribution/ecological niche modeling. If you use any of them, all I ask is you please let me know. You can install two packages which contain most of the functions below (plus many more) directly from GitHub! Just follow these directions:

install.packages('devtools') # if you haven't done this already library(devtools) install_github('adamlilith/omnibus') install_github('adamlilith/legendary') install_github('adamlilith/enmSdm') # you may also need to install the "sp", "dismo", "raster", and other # packages (all from CRAN), depending on the functions you wish to use library(omnibus) library(legendary) library(enmSdm)

Below are some of the functions in these packages (plus a few more).

*art*: Aligned rank transformation for non-parametric ANOVA **[package omnibus]**

To date, you cannot perform a non-parametric ANOVA on 2 or more factors unless one of the two is random (which you can only do if one factor is random–if you have three factors then you’re out of luck). Alternatively, non-parametric data can be transformed to meet normality assumptions and analyzed in a normal ANOVA framework. This script implements the aligned rank transform for up to 4 factors. You can also download the stand-alone program ARTools by Jacob Wobbrock et al. See their short article for a succinct explanation. Note: By default the script uses the mean as a measure of centrality, but other metrics can be used, though the mean is likely the best (Peterson 2002).

*trainMaxEnt*: Calibrate Maxent regularization parameter using AICc (uses Maxent 3.3.3k) [package enmSdm]

*trainMaxNet*: Calibrate Maxent regularization parameter using AICc (uses Maxent 3.4.1+ — also known as “Maxnet”) [package enmSdm]

These functions follows Warren & Siefert’s algorithm for calibrating the master regularization parameter (beta) using AICc. It also tests all possible combinations of feature classes.

Special note: These scripts do not use rasters to do the AIC calculation, unlike what is proposed in Warren & Siefert. Using the rasters take a *long* time, and if you’re using non-random background sites anyway, is inappropriate. This important tidbit was gleaned from a personal communication from Dan Warren cited in Wright, A.N., Hijmans, R.J., Schwartz, M.W., and Shaffer, H.B. 2015. Multiple sources of uncertainty affect metrics for ranking conservation risk under climate change. *Diversity and Distributions* 21:111-122.

*contBoyce*: Calculate the Continuous Boyce Index [package enmSdm]

The CBI is a measure of model accuracy like AUC, but specifically designed for cases where one has no true absences. See Boyce et al. (2002) for the Boyce Index and Hirzel et al. (2006) for the continuous version, which this function calculates.

Takes a network object crated in the *network* package and returns a network with all connected vertices removed (i.e. just those nodes with no edges). If the network is fully saturated and allows loops, returns an empty network object. Otherwise, if it is fully saturated, the function returns a network with a single vertex. Tip: Sometimes coercing a set of geographic points into a network (with edges defined by some minimum distance) then applying a function is faster than the *geogThin* function.

*elimCellDups*: Eliminate duplicate points in a raster cell [package enmSdm]

Takes a data frame with records that have coordinates, overlays it with a raster, and returns a data frame with just one record per cell.

*geoFold:* Assign “geographic k-folds” to sites [package enmSdm]

Divides sites into *k* groups such that there is as little spatial overlap between each group as possible.

*geoThin:* Thin geographic points so that none are within a given distance of one another [package enmSdm]

Thins geographic points such that none are within a user-defined distance of one another. If ties exist, removes points with greatest number of neighbors first, then points closest to geographic center of all points. See also *geoThinApprox()* below for a faster but random version.

*geoThinApprox*: Thin geographic points so that none are within a given distance of one another [package enmSdm]

Thins geographic points such that none are within a user-defined distance of one another. If ties exist, removes points with greatest number of neighbors first, and if ties among these exist, then removes a point randomly. See also *geoThin()* above for a slower but deterministic version.

**trainBrt, trainCrf, trainGam, trainGlm, trainLars, trainMaxEnt, trainMaxNet, trainNs, trainRf: Calibrate boosted regression trees, conditional random forests, generalized additive models, generalized linear models, least angle regression (with interactions and higher-order terms), Maxent (older and newer versions), natural splines, and random forests [in package enmSdm]**

These functions are wrappers for model-specific functions like *glm**()* or *maxent()* that implement “best-practices” calibration, depending on the algorithm (e.g., AICc-based model selection for GAM, GLM, and Maxent, deviance reduction for BRTs, and so on).

*yearFromDate*: Returns a year from a messy date [package omnibus]

Have you ever had a list of dates all in different formats like “2012-01-29”, “Nov 23, 1973”, “12 Nov 18”, and so on? This script takes those list of dates and returns the year in which they occurred. When millennium and century cannot be inferred, a dummy value of “99” is prepended to the output (e.g., “71” becomes “9971”).

*replaceDiacritics*: Remove diacritics

Attempts to replace diacritically marked characters with unmarked character (e.g., à, á, â, and ã all simply become “a”). Note that this won’t actually replace all characters, but it tries! Useful for matching names between different sources, some with and some without diacritics.

## Others’ Useful packages

### Biodiversity: Data Access

auk: Access eBird data

heminthR: London Natural History Museum helminth parasite database

rcites: Data on species protected by CITES or CMS (Convention on Migratory Species)

rebird: Access eBird data

rredlist: IUCN Red List

rgbif: Access GBIF data

spocc: For downloading data from GBIF, BISON, AntWeb, and others.

### Biodiversity: Data Cleaning

biogeo: For cleaning biodiversity data.

spoccutils: Light cleaning and visualization of biodiversity data.

### Built Environments

osmdata: Access Open Street Map data

stplanr: Transport planning

### Climate: Data Access

clifro: New Zealand National Climate Database

GSODR: Global Surface Summary of the Day (GSOD) Weather Data

### ENM/SDM

dismo: The *must-have* package

iSdm: Invasive species distribution and niche modeling

### GIS Data (General)

FedData: For downloading GIS data from several US government data sources (CRAN, GitHub)

weathercan: Download weather station data from Environment and Climate Change Canada (ECCC)

### Graphics

Data-to-Viz taxonomy of graphics

virdis palette

### Phylogenetics

brranching: Obtain phylogenies

taxa: Data structures for taxonomies

### Programming

rmarkdown book for free!

### Remote Sensing: Data Access

rLandsat: LANDSAT data

MODIStsp: MODIS time series

smapr: NASA’s Soil Moisture Active-Passive data

### Statistics

ggeffects: Calculate marginal effects

“Uber” plot for evaluating models in one set of commands