The rapid rise of online biodiversity collections repositories like GBIF and citizen science initiatives like iNaturalist have enabled amazing advancements in evolutionary biology, biogeography, and conservation. Nonetheless, straightforward analysis of specimen and observational data is frequently encumbered by inaccuracies. As a result, these data must always be “cleaned” before use, but steps in cleaning procedures and common but hidden pitfalls along the way are challenging to address.
In this webinar we provide entry points for going from ‘download to data’ for all kinds of specimen/observational data (animal, plant, fungi, etc.), including:
- Online biodiversity specimen repositories and differences between them, including tips and tricks for downloading biodiversity data
- A reproducible, transparent workflow for ensuring:
- Taxonomic integrity (taxonomic name resolution)
- Temporal accuracy (dates, collectors)
- A full account of geographic location uncertainty (georeferencing, coordinate precision versus uncertainty)
- How to report these procedures in publications
Download the Slides
R script for accessing GeoLocate
Answers to questions (please see below)
A general workflow for acquiring and cleaning data
Hidden issues with data downloaded from GBIF (or maybe any data in DarwinCore format)
Hidden issues with data downloaded from iDigBio
Answers to questions asked during the webinar
We have taken the liberty to edit some of the questions and group similar questions. We’ve also categorized them as indicated by the headings below. Thanks for great questions!
May I share this webinar?
Yes, please do! The recording is now available.
Are there any guidelines for best practices in specimen data digitization that a researcher could use when working with a museum collection that has not been digitized?
Yes, there certainly are! It’s very helpful to check them out since many of the “mistakes” made early on have been worked out. iDigBio has many resources, as it is the main institutional effort (in the US) for digitization.
I was wondering if there is some workshop or detailed R script that explains step-by-step the basic process for cleaning a biodiversity dataset.
We are aware of some in the past that could be offered again (Europe, general). Also, the University of California’s Museum of Vertebrate Zoology occasionally offers georeferencing workshops. Also, please see the answer to the next question.
Many thanks for sharing your valuable knowledge! Been working with this type of data for a few years now and its great to see some of my own observations compiled and shared in one place. Please include R scripts within the compiled resources whenever you can… GitHub repository possibly? Thanks again!
Thank you for sharing your data with scientists! R scripts for data cleaning tend to be very specific to a particular dataset because of the idiosyncrasies found in each data set. However, it can certainly be helpful to see some examples (here and here).
Are there data portals for particular countries or regions (e.g., Pakistan)?
Yes, most herbaria and museums have their own databases, which may or may not be online. Many countries and regions have an aggregator database which feeds data into GBIF (the country/region database is called a GBIF “node” in this case). Please see resources at http://www.earthskysea.org/biodiversity-data-portals, though please note this is not (yet) a definitive list!
How do I access a particular database using R? Can I do it in RStudio?
Many of the major databases have their own R package for accessing data. For example, the package rgbif can be used to access data from GBIF. The package spocc allows you to connect to several different databases at once (including GBIF). However, we have found that in some cases these packages do not download all available fields or have problems handling entries in some cells (especially cells with web addresses). So it’s a good idea to check column names and check to see if the number of records in a manual download is the same as when using a software-enabled download.
You mentioned that it was preferable to manually download occurrences from GBIF. Can you describe how to do it for multiple species (>100)?
If you’re looking for all records of a particular taxon (e.g., all plants in a particular area), you can do a higher taxon search (e.g., all Rosaceae or all Sciuridae). Note that the file returned will often be really large. When opening this in R using read.table or read.csv, not all lines may be read in–and R won’t tell you this with a warning or error! You can use fread in the data.table package. See slide #23 in the webinar for more details.
Does the R packages for downloading data from Gbif automatically give the Darwin core archive format?
No it doesn’t, which is why we recommend doing the manual download.
One thing that happens a lot for me (at least at the state level), is where a species-level idea (Maianthemum canadense) is stored in the same table as the subsp/variety level idea (Maianthem canadense var. canadense). If this is the only var/subsp in the state, then the two names are essentially synonymous. However, sometimes there are two accepted var/subsp where the specimen was only ID’d as the species-level idea. Do you have any tips/tricks/resources for addressing this problem?
The only solution we can think of is to go back to the original specimen and see if you can identify it to subspecies/variety yourself or ask someone else to do it. Often specimens lacking subspecific identification in a database simply haven’t been identified at a finer resolution. This said, for mammals location is often used as an identifier for subspecies (e.g., “If you found it in XYZ county, it must be subspecies ABC.”).
Given the errors that have been pointed out through the presentations, I wonder if the advantages of using aggregators (= time saving) are offset by the errors they retain and that, after all, looking at the original collections is not such a dumb idea.
GBIF and most other aggregators don’t do any data cleaning, although they sometimes flag data that seems erroneous. Plus, not all fields are transferred from primary to aggregator databases. So in these cases, it would probably be better to go back to the primary databases. The disadvantage is that you will have to spend much more time searching each database. The primary databases will also have more up-to-date data, as they typically feed data to aggregators on a schedule (e.g., every few months).
Do you have databases you recommend for a particular taxon (e.g., insects)?
Other than the data portals listed at http://www.earthskysea.org/biodiversity-data-portals, we are aware of none for particular taxa. Try looking for museums/herbaria in the area in which you are interested (e.g., country/province). Many smaller institutions won’t have an online database but will be willing to share data if you contact them.
Darwin Core and data formats
Where can I learn more about what the fields in DarwinCore mean?
Please see the glossary at: https://dwc.tdwg.org/
Do you have any methods or tips on how to convert data from DarwinCore to a different format (e.g., BRAHMS) without much loss of time and information?
In general, we recommend creating a “crosswalk” table which matches each column in a data set to another dataset. You can then use this crosswalk to join the two tables manually or using a script. In R, the function combineDf in the omnibus package may be of assistance (full disclosure: this is a package made and maintained by Adam, a presenter in the webinar).
I would like to know please if karyology (e.g., chromosome number) and plant cytogenetics has a role in cleaning data in biodiversity studies.
Genetic evidence can be used to determine species’ identity in uncertain cases.
What is the most exact reference to corroborate the accepted taxonomic names of plants, since synonyms are a big problem.
Unfortunately, for most (all?) taxa, there is no one nomenclatural authority… it somewhat depends on who worked on the taxon last. You can use a taxonomic name resolution service (several are listed here) to find synonymous names and (in most cases) the most agreed-upon name (which would then be the most accepted name). Some TNRSs score names according to number of times they appear in various databases, with the most common assumed to be “accepted.”
Are there *unique* identifiers of the specimens (herbarium sheets) in herbarium aggregator sites? I am concerned about duplicating specimen information, particularly when using aggregator data portals.
Each time a collector creates a specimen it is assigned its own unique identifier that is shared across all the duplicates, so in theory you should be able to identify which specimens are duplicates. In the DarwinCore format, the recordNumber field contains this collector number. Duplicates all have the same collection number, but sometimes this is entered as “7601” and a different herbarium may enter it as “Gillis 7601”, including the collector name. Of course, there are also instances where the collection number does not make it into the digitized data! This is when it becomes useful to look at some of the other data columns, such as recordedBy, year, month, stateProvince, county, etc. to see which specimens were collected at the same time, in the same place, by the same collector that are the same species.
It’s useful to remember that once duplicates are created and sent to other institutions, they each have their own separate trajectory. Some of them may be renamed, georeferenced post de facto to different coordinates, etc. It’s really messy!
Georeferencing and spatial error
When reporting “inexact” or estimated coordinates in the literature (derived from location), should square brackets be used. Is this convention?
No, square brackets indicating inexact coordinates are particular to TROPICOS, the Missouri Botanical Garden’s database. Typically the degree of inaccuracy is expressed as a number representing the radius of a circle around the coordinate where the specimen was likely located. In DarwinCore format the name of this field is coordinateUncertaintyInMeters.
How is the coordinate uncertainty column populated/calculated?
Typically, via the point-radius method that we covered briefly. See Wieczorek, J., Guo, Q., and Hijmans, R.J. 2004. The point-radius method for georeferencing locality descriptions and calculating associated uncertainty. International Journal for Geographical Information Science 18:745-767.
Is there a standard way to calculate uncertainty for records where lat/long is estimated using centroids of countries/states/counties/etc.?
Yes, typically the coordinate uncertainty value is calculated so it encompasses the entire county/state/country (i.e., from the centroid to the farthest border).
Is there a cut-off using the size/range of uncertainty to remove those collections, or it is best just to remove those collections? What is the acceptable value for uncertainty of the point to be used for ENM?
Generally, you want the potential error in coordinates to be small enough that most records fall into the cell of the environmental data that they are purported to occur in. For example, if your environmental data is at 800-m resolution, you would want to choose a small threshold (200 m?) above which coordinates become too uncertain in their position to be used.
This said, if a particular location has high spatial autocorrelation in its environmental factors (e.g., the environment doesn’t change much as you move), then high levels of spatial uncertainty can be acceptable. The usdm package and associated paper explain more. You might also consider using a model that accounts for positional error.
Question about georeferencing: is there a standard executable threshold for uncertainty when incorporating climate data? I have a dataset that includes county names and I’m wondering if georeferencing to the center of the county is acceptable.
In these cases you could use the average value of each environmental variable across each county. Advantages and costs of doing so are discussed in (also see answer to the above question):
Park, D.S. and Davis, C.C. 2017. Implications and alternatives of assigning climate data to geographical centroids. Journal of Biogeography 44:2188-2198.
Collins, S.D., Abbott, J.C., and McIntyre, N.E. 2017. Quantifying the degree of bias from using county-scale data in species distribution modeling: Can increasing sample size or using county-averaged environmental data reduce distributional overprediction? Ecology and Evolution 7:6012-6022.
Pender, J.E., Hipp, A.L., Hahn, M., Kartesz, J., Nishino, M., and Starr, J.R. 2019. How sensitive are climatic niche inferences to distribution data sampling? A comparison of Biotia of North America Program (BONAP) and Global Biodiversity Information Facility (GBIF) datasets. Ecological Informatics 54:100991.
I would like to hear more about ways of automated georeferencing.
To develop SDMs using remote sensing and/or climate data, what would be the best way to proceed to “match” as much as possible the acquisition date of satellite imagery with that of observations obtained, e.g. from GBIF when working on observations collected over long periods of time (e.g. decades). Or would it be unfeasible, especially in the case of remote sensing data?
First, a bit of context: Traditionally–and currently–when using climate data it has been standard to use “normals” averaged across a 30-yr timespan. However, use of weather (i.e., annual/monthly/daily) data is becoming a little more frequent as this kind of data becomes available. For land cover data, it probably depends on how quickly the land cover changes. If a species is dependent on a seasonal or sporadic aspect of habitat (e.g., duration of green-up, fire, etc.), then it is probably best to try to get remote sensing data that matches that temporal resolution (or at least before/maybe during/after), or calculate a frequency (e.g., number of fires in the last 10 yr; average rate of green-up across 5 yr, etc.) It can also depend on how quickly anthropogenic activity is changing land cover–faster change requires shorter temporal periods to characterize. Finally, it depends on how sensitive your species is to changes in land cover.
Is the GeoLocate tool still current? Is the software still being used?
Yes! Geolocate is still maintained and updated, primarily by folks at the Yale Peabody Museum.
How well does GeoLocate perform outside of North America?
Geolocate works well across the globe! When it fails, it can be helpful to look at other sites, including Google Earth, Google Maps (which can give different results from Google Earth), Wikipedia, and other sites.
Comment from a viewer: Geolocate also seems to work in french… “6 km Sud ouest de Yaoundé”!
Why are geoinformation systems (QGIS, ArcGIS) not used for georeferencing and spatial analysis?
They definitely are! We highlighted a few popular tools that are specifically set-up to handle georeferencing issues, but any mapping software (or even physical maps) can be used for georeferencing.
When i was creating a database for a collection from my college I’ve come across some records with no coordinates but a location, happens that some of these locations are known to locals but doesn’t appear on a map, what can be done under these circumstances?
Maps and gazetteers vary in their inclusion of different place names/features. I would try searching for your place name on geonames.org, Google Earth, or openstreetmaps.org, etc. As long as you can find reliable spatial data for a place name, you should be able to georeference it. If this doesn’t work, and you feel that the place name is fairly well agreed upon by locals (e.g., “the bridge between Abbers’ and Beebers’ place”), and you can find it, it’s certainly valid to georeference that location, taking into account uncertainty about how large that area is and how people conceptualize it.
Say you want to know how vertebrate occupancy responds to the presence of an invasive plant. What should you do when you combine two different plant data sources into one distribution map, eg. one that is iNaturalist data with higher uncertainty that has regional and national records, and another that was collected by scientists and has very high resolution but over a much smaller range, say, only within 10km of a coastline? Do you subsample the higher resolution dataset to lower its resolution to match the iNaturalist data?
Generally, downscaling data at a coarse resolution to a fine resolution (e.g., in your example, from NatureServe to the fine-scale data set) is not advisable because you risk overestimating the occurrence of the species in the smaller area. We aren’t aware of a general solution to what you’re posing, although recent data integration methods may be of help. FYI, in regards to NatureServe data in particular, if the data you’re mentioning is their proprietary county or watershed-level data, then their data agreement specifically prohibits attempts to downscale it. They do this to protect locations of threatened species.
Questions we could not answer because they are outside our area of expertise:
How can I preserve aquatic plants for a long time?
Sorry–this is a great question, but is outside our areas of expertise as we are users but bot producers of biodiversity specimen data. Typically plants (even aquatic plants) are dried and pressed onto archival paper.