Maximizing the information inherent in museum and herbarium specimen collections

Guazuma tomentosa, an herbarium specimen
Guazuma tomentosa, an herbarium specimen

Problem #1: We typically infer the distribution of a species, but the evidence of distribution (herbarium or museum data) is collected in a non-standardized manner.

Problem #2: Museum/herbarium data often contain “false positives,” or specimens that are mistakenly mis-identified.

Problem #3: Lack of evidence of occurrence does not connote evidence of absence… false absences confound knowing the true distribution of a species.

Problem #4: A lot of herbarium/museum data is geolocated to large geopolitical units (e.g., state/provinces, etc.). Discarding this data discounts possible occurrence in these regions.

Solution: The enmSdmBayesLand software for the R Statistical Environment meets these needs by:

  • Allowing collector-level covariates to correct for non-systematic sampling
  • Estimates the probability occurrences at a site are actually misidentifications without the need for a “gold standard” data set
  • Estimates the probability of presence and absence

How do I install enmSdmBayesLand?

In R:

install.packages('devtools') # if you haven't already

What kind of data does enmSdmBayesLand use?

esBL requires data on detections of a target species plus detections of “background” species which represent an index of search effort. The idea is that if the focal species were present, it would have a high chance of being collected among the “background” records.

Records are typically represented in a spatial format (e.g., a shapefile or SpatialPolygons object, a raster, or a data frame that can be coerced to one of these formats).

What kind of outputs does esBL produce?

esBL returns an object that is the same as the input (i.e., a raster, or a shapefile/SpatialPolygons object). The output has the same data as the input plus estimates for site-level occupancy, detection, and the probability of mis-identification. All estimates are provided as points (medians) plus spread (lower and upper highest posterior density interval limits). Examples:

The range of Andropogon gerardii
The enmSdmBayesLopod model juxtaposes records of a species (top left) with sampling intensity (bottom left) to estimate the range (main panel). The model can utilize badly georeferenced specimens (this species has only 90 accurately-referenced records but the model uses 5300 records).

Can esBL incorporate spatial autocorrelation?

Yes, spatial autocorrelation can be incorporated in the detection and/or occupancy parts of the model. Examples:

Top left: Variable detectability.
Top right: Constant detectabilty.
Botttom left: Variable detection with conditional autoregression in occupancy .
Bottom right: Constant detection with conditional autoregression in occupancy.

What is the current status of enmSdmBayesLand?

esBL is currently being ported from Stan to JAGS/NIMBLE. Although Stan was somewhat faster, the port will increase stability and enable future fixes faster.

A how-to

The following assumes that you have R installed on your computer and have it started.

install.packages('devtools') # if you haven't already


# plot
detections <- andropogon$detections
detections <- detections / max(detections, na.rm=TRUE)
cols <- paste0('gray', round(100 * detections))
plot(andropogon, col=cols)

# convert to LOPOD object
lopod = shapeToLopod(x = andropogon, effort = 'poaceae',
detections = 'detections',  adj = TRUE, keepFields = TRUE)

# calibrate model
# (using small burn-in and sample values to make it fast)
model = trainBayesLand(lopod, varP = TRUE, q = NULL,
pmin = 0, CAR = TRUE, nChains = 2, warmup = 100, sampling = 200, cores = 2)

plot(model, params='psi' cols='blues')

The image shows the estimated probability of occurrence (we used a larger number of burn-in and sample iterations than in the example code to make a better estimate).