Working with GBIF data

The Global Biodiversity Information Facility (GBIF) is a biodiversity data aggregator, meaning that it is intended to collate all available specimen/observation data available from all of the world’s primary databases. Although it has yet to achieve that goal, it really does represent the first-stop for many searches for biodiversity data. However, we have found several hidden issues with data we download from GBIF, regardless of whether we are using R packages like rgbif or spocc, or performing a manual download. Here are fixes. (Check also working with iDigBio data!)

If you’re doing a manual download, you will receive several files. The one with specimen/observational data is this one:

  1. In R: The R packages do not download the full Darwin Core archive version of the data.
    • As a result, you will not receive all of the data that is necessary for cleaning the data in a reliable manner.
    • Solution: For now, do a manual download.
  2. In R: If you do a manual download and the download file has >~500,000 lines and you try to open it using read.table or read.csv, the file will apparently open, but only the first ~500,000 lines or so, without a warning or error!
    • Solution: Use the fread function in the data.table package. You can save this as a binary object using the save function, and it will load properly when need to do that next time.
  3. In Excel: If the file has too many rows, Excel will open it with a warning that excess rows are not displayed. The solution to this is to split the file into smaller ones (with say, only 500,000 records each) using a program like PilotEdit Lite.
  4. Manual or R download, opened in R: Often web addresses in some of the fields cause weird “wrapping” behavior where values in cells are shifted over and names of columns are incorrect. Here’s an example:
  • The name of the column on the right should be “recordedBy”, not “http…unknown.org.nick”.
  • Solution: Open the file in Excel or a program like PilotEdit Lite. Save it in CSV format. Then open it using read.table, read.csv, or (for large files) fread (data.table package). The fread function will “heal” the bad lines, but may also discard lines when doing so. To determine how many were lost, compare the number of records GBIF say are in the download with the number you actually get after using fread. If you know how to get the “lost” records back, please let us know!