A general workflow for acquiring and cleaning biodiversity data

Adam B. Smith | Kelley Erickson | Stephen Murphy
Missouri Botanical Garden | April 2020

Below is a generalized workflow for cleaning biodiversity specimen data acquired from biodiversity portals. We have numbered the steps, but often the order can be changed.  In our experience, the process is highly iterative—the more you work with a dataset the more you find issues.  So you will often have to go back several steps to correct something that was missed initially.

  1. Discover/contact relevant data resources. See Data Portals – Global Change Conservation
  1. Download data
    • Decide on search terms.  Species may have several names due to taxonomic name revision or disagreements.  See resources at Data Portals – Global Change Conservation.
    • If given the option, download all available data (not a “simple” version).
    • Check integrity of download… do manual and automated downloads return the same records and fields?
    • Check for improper line/column wrapping (often occurs when records have web addresses… does not throw an error in R).
    • Save your raw data.  Never work on this file; just make copies. Save the download date and DOI if available.
  2. Join datasets from different databases (optional; can be done later)
    1. Create a crosswalk table to join records from different databases
      • Example: Column “location” in dataset A corresponds to column “locality” in dataset B
      • Ensure that the crosswalked columns have the same data type (e.g., one may be a character and another a numeric value). If needed, create new fields for “reconciled” data representing the same kind of information.
    2. Use the crosswalk to join the datasets
  3. Scan for invalid/missing species names and correct
    • Be aware that species may have changed named or specimens may be misidentified. Always check the original specimen if possible.
    • Use a taxonomic name resolution service (TNRS) to assign “accepted” names to species.  See resources at See Data Portals – Global Change Conservation.
  4. Flag records for undesirable cases (e.g., cultivated/captive, invasive, etc., depending on your intended use).
    • Suggestions for keyword searches for “unnatural” specimens: cultivate*, captive, planted, zoo*, garden*, experiment*, horticultur*, greenhouse*, hothouse*, purchase*, bought, pot, cage*
    • Note that animals won’t often be collected at a garden, but sometimes plants can be collected at a zoo.
    • Also search for these terms in other relevant languages. For example, if some of the specimens occur in a Spanish-speaking country, search for jardin* (garden), etc.
  5. Clean collection dates. Do any fall outside expectation (e.g., collections from the apparent future?).
    • Common issues:
      • Dates missing a century: ’18 or 12/31/18
      • Unclear month/day/year: 12/5/2000, 11/12/13
      • Dates of collection outside a collectors’ lifespan or time of activity
    • Look in other fields for information relevant to date of collection. In some cases you may be able to identify a general time period of collection from the dates the collector was active.
  6. Clean existing longitude/latitude values
    1. Common issues:
      • Zero coordinates (0, 0) used for “not recorded”
      • Swapped coordinates: -78.235 used for longitude when it was actually latitude
      • Missing negative signs: 78.235 longitude when it should have been -78.235
      • Mistaken decimal placement: 7.8253 versus 78.253
    2. Plot the occurrences… do they make sense?
      • Remove/correct occurrences that do not
      • Do occurrences near a coast actually fall on land/water (depending on what is appropriate for the species)?  Spatial polygons used to represent land/water can be different, so an occurrence that falls on the coast in one can fall in the water in another.
    3. Check if occurrences with coordinates fall within the stated country/state/county.
      • Will need to correct country/state/county names!
      • Watch out for diacritics messing up matching of names (eg., éóñŠ)–in R can use function iconv
  7. Georeference records missing coordinates and calculate coordinate uncertainties
  8. Flag duplicate records
    • Depending on your analysis, “duplicates” can be:
      • Occurrences of the same species in the same raster cell
      • Actual duplicates (e.g., same species collected by the same collector on the same date and sent to different herbaria–common only among plants). Note that they may not necessarily have the same species name or longitude/latitude assigned to them since once they are separated each specimen has its own history.
  9. Discard records that are too imprecise, cannot be adequately dated, identified, etc.
    • Note: We recommend doing this last as you often encounter relevant information in fields as you go through each step. Discarding earlier risks missing this kind of information.
  10. Repeat!  Invariably, you will encounter erroneous records as you become more familiar with your data. It’s almost impossible to do a reliable cleaning the first time through.

If you do clean specimen data and discover errors, consider contacting the original data provider (i.e., the museum/herbarium where the specimen is housed–not to be confused with a database aggregator which draws data from other databases). Often they are pleased to receive corrections!

Finally, please cite the data providers (both aggregators and primary databases). After all, records often represent cumulatively hundreds of years of collector and curation effort. It really helps those institutions demonstrate usefulness when requesting funding.