The combination of large datasets from several sources requires careful harmonization of taxon names. Non-standardized, incorrect, ambiguous, or synonymous taxonomic names can lead to unreliable results and conclusions. The bdc package includes tools to help standardize major taxonomic groups’ names (e.g., animals and plants) by comparing scientific names against one out of 10 taxonomic databases. The taxonomic harmonization process borrows heavily from Norman et al. (2020); taxadb package), which contains functions that allow querying millions of taxonomic names in a fast, automated, and consistent way using high-quality locally stored taxonomic databases.
The functions used to harmonize names (bdc_clean_name and bdc_query_names_taxadb) contain several additions to the taxadb package, including tools for:
The taxonomic harmonization is based on one taxonomic database chosen by the user. taxadb makes available the following taxonomic sources:
|col||Catalogue of Life|
|itis||Integrated Taxonomic Information System|
|iucn||International Union for Conservation of Nature’s Red List|
|gbif||Global Biodiversity Information Facility|
|ncbi||National Center for Biotechnology Information|
|tpl||The Plant List|
Note that some taxonomic databases may be momentarily unavailable in taxadb. Check ?bdc_query_names_taxadb for a list of available taxonomic databases.
Check here how to install the bdc package.
Scientific names improperly formatted usually cannot be matched with valid names. To solve this issue, we developed the bdc_clean_name containing functionalities to unify writing style of scientific names. This optimize the taxonomic queries by increasing the probability of finding matching names. This tool is used to:
parse_names <- bdc_clean_names(sci_names = database$scientificName, save_outputs = FALSE) #> >> Family names prepended to scientific names were flagged and removed from 0 records. #> >> Terms denoting taxonomic uncertainty were flagged and removed from 1 records. #> >> Other issues, capitalizing the first letter of the generic name, replacing empty names by NA, and removing extra spaces, were flagged and corrected or removed from 0 records. #> >> Infraspecific terms were flagged and removed from 1 records. #> #> >> Scientific names were cleaned and parsed. Check the results in 'Output/Check/02_clean_names.csv'.
An example of bdc_clean_names output.
Let’s merge the names parsed with the complete database. As the column ‘scientificName’ is in the same order in both databases (i.e., parse_names and database), we can append names parsed in the database. Also, only the columns “names_clean” and “.uncert_terms” will be used in the downstream analyses. But don’t worry, you can check the results of the parsing names process in “Output/Check/02_parsed_names.qs”.
The taxonomic harmonization is based upon one of those taxonomic authorities previously mentioned. It starts with creating a local database by downloading, extracting, and importing the taxonomic database informed by users using the taxadb package. The download may take some time, depending on the internet connection.
⚠️IMPORTANT: If will have a problem downloading databases, please consider removing the previous versions of taxonomic databases using
query_names <- bdc_query_names_taxadb( sci_name = database$names_clean, replace_synonyms = TRUE, # replace synonyms by accepted names? suggest_names = TRUE, # try to found a candidate name for misspelled names? suggestion_distance = 0.9, # distance between the searched and suggested names db = "gbif", # taxonomic database rank_name = "Plantae", # a taxonomic rank rank = "kingdom", # name of the taxonomic rank parallel = FALSE, # should parallel processing be used? ncores = 2, # number of cores to be used in the parallelization process export_accepted = FALSE # save names linked to multiple accepted names ) #> A total of 0 NA was/were found in sci_name. #> #> 115 names queried in 3.1 minutes
Merging results of the taxonomy harmonization process with the original database. Before that, let’s rename the column containing the original scientific names to “verbatim_scientificName”. From now on, “scientificName” corresponds to the verified names (resulted from the name harmonization process). As the column “original_search” in “query_names” and “names_clean” are equal, only the first will be kept.
The report is based on the column notes containing the results of the name harmonization process. The notes can be grouped into two categories: accepted names and those with a taxonomic issue or warning, needing further inspections. Accepted names are returned as “valid” in the column “Description”. The report can be automatically saved if
save_report = TRUE.
report <- bdc_create_report(data = database, database_id = "database_id", workflow_step = "taxonomy", save_report = FALSE) report
It is also possible to filter out records with taxonomic status different from “accepted”. Such records may be potentially resolved manually.
unresolved_names <- bdc_filter_out_names(data = database, col_name = "notes", taxonomic_status = "accepted", opposite = TRUE)
Save the table containing unresolved names
It is possible to remove records with unresolved or invalid names to get a ‘clean’ database. However, to ensure that all records will be evaluated in all the data quality tests (i.e., tests of the taxonomic, spatial, and temporal modules of the package), potentially erroneous or suspect records will be removed in the final module of the package.
# output <- # bdc_filter_out_names( # data = database, # taxonomic_notes = "accepted", # opposite = FALSE # )
You can use qs::qsave() instead of write_csv to save a large database in a compressed format.