Introduction

The combination of large datasets from several sources requires careful harmonization of taxon names. Non-standardized, incorrect, ambiguous, or synonymous taxonomic names can lead to unreliable results and conclusions. The bdc package includes tools to help standardize major taxonomic groups’ names (e.g., animals and plants) by comparing scientific names against one out of 10 taxonomic databases. The taxonomic harmonization process borrows heavily from Norman et al. (2020); taxadb package), which contains functions that allow querying millions of taxonomic names in a fast, automated, and consistent way using high-quality locally stored taxonomic databases.

The functions used to harmonize names (bdc_clean_name and bdc_query_names_taxadb) contain several additions to the taxadb package, including tools for:

  1. Cleaning and parsing scientific names;
  2. Resolving misspelled names or variant spellings using a fuzzy matching application;
  3. Converting nomenclatural synonyms to current accepted name;
  4. Flagging ambiguous results;
  5. Reporting the results.

The taxonomic harmonization is based on one taxonomic database chosen by the user. taxadb makes available the following taxonomic sources:

Abbreviation Taxonomic database
col Catalogue of Life
fb FishBase
itis Integrated Taxonomic Information System
iucn International Union for Conservation of Nature’s Red List
gbif Global Biodiversity Information Facility
ncbi National Center for Biotechnology Information
ott OpenTree Taxonomy
tpl The Plant List
slb SeaLifeBase
wd Wikidata

⚠️IMPORTANT:

Note that some taxonomic databases may be momentarily unavailable in taxadb. Check ?bdc_query_names_taxadb for a list of available taxonomic databases.

Installation

Check here how to install the bdc package.

Reading the database

Read the database created in the Pre-filter module of the bdc package. It is also possible to read any datasets containing the required fields to run the package (more details here).

database <-
  readr::read_csv(here::here("Output/Intermediate/01_prefilter_database.csv"))

Clean and parse species names

Scientific names improperly formatted usually cannot be matched with valid names. To solve this issue, we developed the bdc_clean_name containing functionalities to unify writing style of scientific names. This optimize the taxonomic queries by increasing the probability of finding matching names. This tool is used to:

  1. Flag and remove family names pre-pended to species names;
  2. Flag and remove qualifiers denoting uncertain or provisional status of taxonomic identification (e.g., confer, species, affinis, and among others);
  3. Flag and remove infraspecific terms (e.g., variety [var.], subspecies [subsp], forma [f.], and their spelling variations);
  4. Standardize names, i.e., capitalize only the first letter of the genus name and remove extra whitespaces);
  5. Parse names, i.e., separate author, date, annotations from taxon name.
parse_names <-
  bdc_clean_names(sci_names = database$scientificName, save_outputs = FALSE)

#> >> Family names prepended to scientific names were flagged and removed from 0 records.
#> >> Terms denoting taxonomic uncertainty were flagged and removed from 1 records.
#> >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and removing extra spaces, were flagged and corrected or removed from 0 records.
#> >> Infraspecific terms were flagged and removed from 1 records.
#>
#> >> Scientific names were cleaned and parsed. Check the results in 'Output/Check/02_clean_names.csv'.

An example of bdc_clean_names output.

Let’s merge the names parsed with the complete database. As the column ‘scientificName’ is in the same order in both databases (i.e., parse_names and database), we can append names parsed in the database. Also, only the columns “names_clean” and “.uncert_terms” will be used in the downstream analyses. But don’t worry, you can check the results of the parsing names process in “Output/Check/02_parsed_names.qs”.

parse_names <-
  parse_names %>%
  dplyr::select(.uncer_terms, names_clean)

database <- dplyr::bind_cols(database, parse_names)

Names harmonization

The taxonomic harmonization is based upon one of those taxonomic authorities previously mentioned. It starts with creating a local database by downloading, extracting, and importing the taxonomic database informed by users using the taxadb package. The download may take some time, depending on the internet connection.

⚠️IMPORTANT: If will have a problem downloading databases, please consider removing the previous versions of taxonomic databases using fs::dir_delete(taxadb:::taxadb_dir())

query_names <- bdc_query_names_taxadb(
  sci_name            = database$names_clean,
  replace_synonyms    = TRUE, # replace synonyms by accepted names?
  suggest_names       = TRUE, # try to found a candidate name for misspelled names?
  suggestion_distance = 0.9, # distance between the searched and suggested names
  db                  = "gbif", # taxonomic database
  rank_name           = "Plantae", # a taxonomic rank
  rank                = "kingdom", # name of the taxonomic rank
  parallel            = FALSE, # should parallel processing be used?
  ncores              = 2, # number of cores to be used in the parallelization process
  export_accepted     = FALSE # save names linked to multiple accepted names
)

#> A total of 0 NA was/were found in sci_name.
#>
#> 115 names queried in 3.1 minutes

Merging results of the taxonomy harmonization process with the original database. Before that, let’s rename the column containing the original scientific names to “verbatim_scientificName”. From now on, “scientificName” corresponds to the verified names (resulted from the name harmonization process). As the column “original_search” in “query_names” and “names_clean” are equal, only the first will be kept.

database <-
  database %>%
  dplyr::rename(verbatim_scientificName = scientificName) %>%
  dplyr::select(-names_clean) %>%
  dplyr::bind_cols(., query_names)

Report

The report is based on the column notes containing the results of the name harmonization process. The notes can be grouped into two categories: accepted names and those with a taxonomic issue or warning, needing further inspections. Accepted names are returned as “valid” in the column “Description”. The report can be automatically saved if save_report = TRUE.

report <-
  bdc_create_report(data = database,
                    database_id = "database_id",
                    workflow_step = "taxonomy",
                    save_report = FALSE)

report

Unresolved names

It is also possible to filter out records with taxonomic status different from “accepted”. Such records may be potentially resolved manually.

unresolved_names <-
  bdc_filter_out_names(data = database,
                       col_name = "notes",
                       taxonomic_status = "accepted",
                       opposite = TRUE)

Save the table containing unresolved names

unresolved_names %>%
  readr::write_csv(., here::here("Output/Check/02_unresolved_names.csv"))

Filtering the database

It is possible to remove records with unresolved or invalid names to get a ‘clean’ database. However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal modules of the package), potentially erroneous or suspect records will be removed in the final module of the package.

# output <-
#   bdc_filter_out_names(
#     data = database,
#     taxonomic_notes = "accepted",
#     opposite = FALSE
#   )

Saving the database

You can use qs::qsave() instead of write_csv to save a large database in a compressed format.

# use qs::qsave() to save the database in a compressed format and then qs:qread() to load the database
database %>%
  readr::write_csv(.,
            here::here("Output", "Intermediate", "02_taxonomy_database.csv"))