Archive for the ‘Nichey’ Category

NSF Advances in Biological Informatics

Friday, March 2nd, 2018

Awesome news! We were just informed that the National Science Foundation will fund our proposal to use pollen, genetic, and distributional data to estimate the spatial dynamics of how trees migrated poleward after the last glacial maximum.  This is a collaborative project with Sean Hoban (Morton Arboretum), Andria Dawson (Mount Royal University), John Robinson (Michigan State University), and Allan Strand (College of Charleston). We will be hiring two postdocs over the 3 years of the grant. The first position will be based at The Morton Arboretum near Chicago, Illinois and the second at Michigan State University.


bayesLopod: Species distribution modeling with “messy” data

Thursday, March 1st, 2018

Collectively, biodiversity databases represent over a billion specimens and sightings of species.  Unfortunately, quite often 60-90% or more of that data does not meet the standards necessary for biogeographic analysis: coordinates are missing or blatantly wrong, dates are missing, and some identifications can be questionable.  Typically this data is discarded before analysis, even though it represents hundreds, perhaps thousands of years of person-work in collection and curation.  More importantly, many of the discarded records are probably critical to understanding historical distributions and environmental tolerances of species because they represent the only known collections from a given location or on the edge of the range. Many of these erstwhile “unusable” records are historical and yet very valuable.  Indeed, using historical records indicative of pre-anthropogenic range contractions better estimates  species’ environmental tolerances.

We are excited to announce Version 1.01 of bayesLopod, a Bayesian modeling framework that can use vaguely-georeferenced specimen records and estimate the probability that a record that falls outside the body of the distribution was incorrectly identified. bayesLopod is an R package that relies on Stan, a Bayesian coding language (which you do not need to know!) that approximates posteriors very fast compared to BUGS or JAGS.  The input is either a points file, raster, or a shapefile, with detections and some background estimate of sampling effort.  The output is in the same data format and provides an estimate of the probability of occupancy and the probability that a sample unit (e.g., raster cell) contains an incorrectly-identified record.  We hope this tool can help conservation biogeographers better address pressing questions about Earth’s biodiversity.

Available on CRAN and GitHub.

The range of Andropogon gerardii

The bayesLopod model juxtaposes records of a species (top left) with sampling intensity (bottom left) to estimate the range (main panel). The model can utilize badly georeferenced specimens (this species has only 90 accurately-referenced records but the model uses 5300 records).



Upscaling biodiversity

Tuesday, January 23rd, 2018

Our long-awaited paper on predicting country-scale biodiversity from small plots is out! Of 19 “upscaling” techniques, the most successful method was able to predict total plant richness in the United Kingdom with <10% error, though few techniques were able to recreate the shape of the actual species-area relationship.

Kunin, W.E., Harte, J., He, Fangliang, Hui, C., Jobe, R.T., Ostling, A., Polce, C., Šizling, A., Smith, A.B., Smith, Krister, Smart, S.M., Storch, D., Tjørve, E., Ugland, K-I., Ulrich, W., and Varma, V.  Accepted.  Up-scaling biodiversity: Estimating the species-area relationship from small samples.  Ecological Monographs. doi: 10.1002/ecm.1284


Upscaling Biodiversity


Phenotypic distribution modeling

Tuesday, November 14th, 2017

Our latest paper in Global Change Biology on modeling intraspecific phenotypic variation has gotten great press!  Combined, the news outlets covering our research reach ~78 million people and included The San Francisco Chronicle, The Seattle Times, US News and World Report, The Topeka Capital Journal, The Manhattan Mercury, and numerous other regional newspapers, radio stations (e.g., KWMU 90.7), TV stations (e.g., KWCH12), and science news websites (e.g., Science News Online)!

Smith, A.B., Alsdurf, J., Knapp, M. and Johnson, L.C.  2017.  Phenotypic distribution models corroborate species distribution models: A shift in the role and prevalence of a dominant prairie grass in response to climate change.  Global Change Biology 23:4365-4375. doi: 10.1111/gcb.13666

Change in biomass of Andropogon gerardii due to climate change

Change in biomass of Andropogon gerardii due to climate change

Climate paths and climate change communication

Tuesday, February 28th, 2017
Climate path of St. Louis

Climate path of St. Louis, Missouri, USA

How can we communicate global warming to local audiences (= everybody who lives in a place)? Recently I made a poster showing the locations that climatically currently resemble the future climate of St. Louis.

But how did I know where to locate the “future” St. Louis climatically?  By running species distribution models in “reverse”.  First, I created a set of 100 points to represent St. Louis (I actually had them have the exact same coordinates–it’s false sample size inflation, but it doesn’t matter much since at first approximation St. Louis is a point–and it does allow me to use more complex fitting features in Maxent).

Second, I associated these points with the climate layers I have for the 2070s (once for each emissions scenario).

I then trained a Maxent model using this future climate data, then projected it back to the present.

Finally, I calculated the geographic center of gravity of all cells using the predicted suitability as weights.  The center gravity is the average location of the “future” climate of St. Louis!  I found I got slightly better (= intuitive) results by thresholding first, then using suitability values above the threshold as weights. I also found I got better results when using only mean annual temperature and precipitation, rather than all 19 WORLCLIM variables.

This procedure is fairly simple and takes advantage of the fact that 1) “species” distribution models are not just for species, and 2) the output of a SDM (or whatever you want to call them) is really just an index of similarity of a multivariate space (= climate layers) at a set of points (= presences) and another set of points (= all grid cells in the layer to which you’re projecting).

I’ll be trying the poster out at the Missouri Botanical Garden’s upcoming Science Open House–hopefully it will spark some conversation!

Which is worse for biodiversity, a dollar of beef or gasoline?

Thursday, November 10th, 2016

ConsumerismWhich produces more climate change, consumption of gasoline or beef? We have a good idea about the answer to this question.  But now ask, which displaces more biodiversity?  We have no idea–until now.  Just today our article on biodiversity impacts of economic consumption was released in Conservation Letters.  Spearheaded by Justin Kitzes and chaperoned by John Harte, this analysis considers all the direct and indirect impacts of consumption across the entire world economic system.  For example, agriculture directly displaces biodiversity, but so does the insurance industry by its need for paper, transportation, energy, and so on.  Personally, I am surprised by the impact of eating rice versus, say, buying paper–the Earth would be much better if we could digest the latter!

Kitzes, J., Berlow, E., Conlisk, E., Erb, K., Iha, K., Martinez, N., Newman, E.A., Plutzar, C., Smith, A.B., and Harte, J.  In press.  Consumption-based conservation targeting: Linking biodiversity loss to upstream demand through a global wildlife footprint.  Conservation Letters.

A Perfect Storm of Threats

Wednesday, November 2nd, 2016
Number of rare plant species threatened by recreation

Number of rare plant species threatened by recreation

Just out: a new analysis by Haydee Hernández-Yáñez and 7 other students at the University of Missouri-Saint Louis and myself on the threats that affect all known rare plants of the US! This is a reprise of the analysis by David Wilcove and colleagues from 1998.  We already got coverage on NPR and Inside Science!

Not appearing in the analysis is the spatial aspect (see image to the right) which we decided to drop near the end because of the article was getting too long.  Still, I’m hoping this will become something else on its own!

Hernández-Yáñez, H., Kos, J.T., Bast, M.D., Griggs, J.L., Hage, P.A., Killian, A., Whitmore, M.B., Loza, M. L., Smith, A.B.  2016.  A systematic assessment of threats affecting the rare plants of the United States.  Biological Conservation 203:260-267.

Importing NLDAS and GLDAS data into R

Monday, August 29th, 2016


OK, I just spent the entire day obtaining and learning how to import the NLDAS and GLDAS data into R.  This could have been made a lot simpler with better meta-data descriptions and “readme” files placed in locations they need placed.  In any case, I’m posting this to save anyone else wanting to use this data some precious time.  In case you didn’t know (I didn’t until yesterday), the NASA Land Data Assimilation Systems (NLDAS) and Global Land Data Assimilation Systems (GLDAS) are measured/interpolated and/or modeled climate and land surface variables for essentially the conterminous US (NLDAS) at 0.125 deg resolution or the world (GLDAS) at 1 deg resolution.  There are a lot of variables of interest, including the basic set of min/max air temperature and precipitation, plus snowfall, soil temperature, LAI, albedo, incoming/net shortwave and longwave radiation, etc.  The data is available in sets representing calculations at 3-hr intervals or monthly intervals or averages for each month across the given time period.  Most of the temporal extents of these models cover 1979 to the present.

Both NLDAS and GLDAS have version 1 and 2, the latter being newer and more sophisticated.  Both have also been run with 3 land surface models: Mosaic, Noah, VIC–but wait, there’s a fourth, SAC, which is not described in the ReadMe file for NLDAS2.  There are also “FORA” and “FORB” data sets put alongside the three models with little explanation as to what they are.  They contains the forcing variables used by the land surface models.  Note that the forcing variables remain unchanged across the three land surface models, so if you want a variable in the forcing set you can just get it from FORA or FORB.

Obtaining the data

If you want the entire hourly dataset it will take a long time to download as each model set contains tens of thousands of files.  The monthly is only a few thousand; the monthly averages only 24 (one raster file plus one XML file per month)–these have the word “climatology” in their data set names.

There are many ways to get the data, including using wget which snatches files from a list of links but whose help is written for Unix/Linux, which is a girl I used to know.  You can also get subsets of the hourly data using the Simple Subset Wizard (search for “NLDAS” or “GLDAS”).  The SSW can also export the files in NetCDF format, which obviates the stuff below but I found the SSW did not always give me the full set of results.  Hourly/monthly data are available from NASA’s Mirador (same search) or GES DISC.  I used the latter then DownloadThemAll, a Firefox plugin.  This still took a lot of clicking, but not near as much as if I had done it manually.  (You’ve also got this badly documented FTP site.)

Extracting the data

So… the problem is that G/NLDAS files are stored in GRIB format, akin to a raster brick, but with no meta data on layer identity that is automatically imported into R or ArcMap when the raster is loaded.  The XML file that comes with each GRIB file has a list of variables, but they are not in the order they appear when imported into R. Likewise, when imported into R the metadata that should come with a GRIB file is not associated with the file contents, so you are left with a long series of rasters with many meaningless numbers.


In R:

grib <- readGDAL(‘<filename of GRIB file>’) # read GRIB file
grib <- brick(grib) # convert to raster brick
grib # notice brick has N layers

2. Download wgrib.

3. Open a command (DOS) window and navigate to the folder with wgrib.  In Windows you can get a DOS window by pressing the Windows key then typing “cmd”.

4. Issue “wgrib <filename with no spaces>”.  The output will show a table with variable names and attrbutes for each layer. You will need to copy the GRIB file into the same folder as wgrib or put it into a folder with no spaces in its name or any of its parent folders.  Probably a way around this…

5. Consult the metadata file “README.NLDAS2.pdf” from the G/NLDAS website and see Table 4a therein.  Find the “Short Name” of the variable you want.

# 6. Now look for that variable name in the DOS command window. The output from wgrib will show you the layer number of that variable.  Remember this number… call it “x”.

7. Back in R:

myLayer <- grib[[x]] # the layer you want

NB This seems to work for every variable except TSOIL (soil temperature) which for the file I experimented with has 3 such layers.  I am guessing these pertain to the three soil layers for the particular land surface model.  There was also a band named “var255” at the end, which had what seem like meaningful values of some variable.

Note that the layer you want may not be in the same place across land surface models–i.e., LAI may be layer x in one and y in another.

As my professor said to me once, if things don’t go well for you at least make it better for the next person.

Weighing the importance of scale

Monday, August 1st, 2016

I just finished an exciting read: Schweiger & Beierkuhnlein’s study on how well temperature predicts distribution of 19 vascular plants across 3 spatial scales (ranging from ~<1 m to 1000s of km).  Overall they find regardless of the scale the same optimum temperature is observed (weak scale dependence).  Nonetheless, they also find that the maximum probability of occurrence increases with grain size (strong scale dependence), which they interpret to mean that temperature is a more important driver of distribution at coarse scales.

Cool stuff!  I’ve been specifically wondering about this for a while, and this seems to be the first test thereof.

But what does it really mean?  First, I should say that their analysis was based on extracting metrics from the modeled response curves, not the response curves per se–and to my eye the curves for any particular species seem very different across scales even after correcting for differences in height (their Figs. 1 and S1).  I would have liked to see a statistical comparison of the shapes of curves.

But let’s let that lie and think about what they found. In a nutshell, their results are predicted by the Eltonian noise hypothesis which posits that abiotic drivers like temperature will be more important at coarse scales while biotic drivers will create “noise” in distribution at fine scales–noise that will be generally imperceptible at coarse scales. They infer this from the fact that maximum probability of presence increases with coarseness of grain (i.e., when predicting presence at fine grains the maximum probability will be low).  Ergo, temperature is a stronger predictor of presence at coarse scales.

While I can’t refute this observation on face value, I do wonder if the maximum probability of occurrence that they estimated at coarse grains is more than expected by chance based on combining probabilities of presence at fine grains.  Consider for example, a simple situation where the “coarse” spatial domain (of area A) is composed of 2 fine-grain domains (each of area A/2).  Also assume that the probability of presence in each of the finer domains is p1 and p2.  Assuming independence between the two fine-grain domains, the probability of presence at the coarse domain will be 1 – (1 – p1)(1 – p2). For a simple case, assume that p1 = p2:

Fine- vs Coarse-Scale Pr(Occupancy)

Probability of occurrence at coarse scale as a function of probability of occurrence at fine scales

We can see that except at the extreme cases of p1 = p2 = 0 and 1, coarse-scale probability of occurrence is always higher than fine-scale probability of occurrence.  So the relevant question is “Does the increased probability of occurrence at coarse scales exceed what we’d expect by chance given that the coarse domain is composed of fine domains?”  If so, only then can we say that temperature is a more important determinant of distribution at coarse scales.  And that is what I would take as verification of this particular prediction of the Eltonian noise hypothesis.


Schweiger, A.H. and Beierkuhnlein, C.  2016.  Scale dependence of temperature as an abiotic driver of species’ distributions.  Global Ecology and Biogeography 25:1013-1021.  DOI: 10.1111/geb.12463

Species distribution models not for species

Wednesday, July 20th, 2016
SDMs - Not just for species

Mathematically these are all the same.

Have you ever see the number of people who drowned by falling into a swimming pool–films starring Nicholas Cage model?  You might also know it as “linear regression.”  Have you ever seen a species distribution model?  By calling it thus we make the same limiting semantic complexification as in the first case.

This post is not about the debate over whether we should be calling it species distribution modeling or ecological niche modeling (notice the participle form of each term–adding an ing refers to the act of modeling).  I’m talking about calling them species distribution models.

To put it bluntly, the underlying mathematics of a SDM doesn’t care whether it’s depicting a species or anything else.  In fact there are numerous examples of applications of species-less “SDMs”:

I even once met a person who uses Maxent to locate opportune places for underwater archaeology!

The fundamental commonality that allows all of these phenomena to be modeled by “SDM” algorithms is the nature of the response data–either unary (i.e., just presences) or binary (presences and absences).  Ergo, if you can describe a pattern with unary or binary data, you can also likely apply an “SDM”.

So feel free to refer to “my species distribution model” to reference your particular model of a species’ distribution.  But don’t let the moniker box you into thinking they’re just for species!