Download lineage information for species names and taxids with R

Short R example on how to download the complete taxonomic lineage information for a number of NCBI taxids.

Introduction

I have always struggled with how I can efficiently get the lineage information for an NCBI taxid. I have downloaded the complete taxonomy files from NCBI and tried to parse it locally. I can tell you this has been painful. Recently I had to retrieve lineage information for many thousands of species names and taxids so I invested some time into searching for a better solution to this problem. Although there are several R package accessing the NCBI API no single package could give me the desired output. I ended up using three R packages myTAI, taxize and plyr. Here is the workflow:

Prerequisites

First we need to install and load the required packages

install.packages(c("taxize","myTAI","plyr")
library(taxize)
library(myTAI)
library(plyr)

Download taxon summary

Next we have to download all the taxonomy summary information for each element of a given vector containing taxids. This will retrieve a summary dataframe containing the taxon name, its rank and taxonid. It is only necessary to do this if you start from taxids. If you have species names you can skip this step.

taxon_summary <- ncbi_get_taxon_summary(id = taxids)

Note: You may consider getting an ENTREZ key to be able to send more requests to the NCBI API. Look [here] (https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/) on how to do this. You can add the key by supplying the argument key="<your_key> to the ncbi_get_taxon_summary function.

Get complete lineage information

We can now use this summary dataframe to retrieve the complete lineage information using myTAI. Here we create an empty list which will contain the lineage information for each individual query. For each query we create a dataframe with the lineage information. It is important that this dataframe has named columns, since we need them later when we combine the list into a single dataframe. I also added a Sys.sleep(0) command since NCBI restricts access to its API when it receives to my requests.

df_list <- list()
for (i in 1:nrow(taxon_summary)){
tax  <- taxonomy(organism = taxon_summary[i,]$name, db = "ncbi",output = "classification")
df <- data.frame(lapply(tax$name, function(x) data.frame(x)))
colnames(df) <- tax$rank
df_list[[i]] <- df
Sys.sleep(0.5)
}

Combining the output into a single dataframe

Once this is done we can combine the list of dataframes into a single dataframe. Missing values will be filled with <NA> automatically, all with the magic of dplyr.

combined_df <- do.call(rbind.fill, df_list)

The whole script

Here are all the different parts combined while adding some example taxonids as a test:

install.packages(c("taxize","myTAI","plyr")
library(taxize)
library(myTAI)
library(plyr)

taxids <- c(644223,1099808,169507,324739,324740)

taxon_summary <- ncbi_get_taxon_summary(id = taxids)

df_list <- list()
for (i in 1:nrow(taxon_summary)){
tax  <- taxonomy(organism = taxon_summary[i,]$name, db = "ncbi",output = "classification")
df <- data.frame(lapply(tax$name, function(x) data.frame(x)))
colnames(df) <- tax$rank
df_list[[i]] <- df
Sys.sleep(0.5)
}

combined_df <- do.call(rbind.fill, df_list)

For this example the final dataframe combined_df looks like this:

 no rank superkingdom kingdom subkingdom     phylum        subphylum           class             order   
cellular organisms    Eukaryota   Fungi    Dikarya Ascomycota Saccharomycotina Saccharomycetes Saccharomycetales
cellular organisms    Eukaryota   Fungi    Dikarya Ascomycota Saccharomycotina Saccharomycetes Saccharomycetales
cellular organisms    Eukaryota   Fungi    Dikarya Ascomycota Saccharomycotina Saccharomycetes Saccharomycetales
cellular organisms    Eukaryota   Fungi    Dikarya Ascomycota Saccharomycotina Saccharomycetes Saccharomycetales
cellular organisms    Eukaryota   Fungi    Dikarya Ascomycota Saccharomycotina Saccharomycetes Saccharomycetales
            family         genus                     species
Phaffomycetaceae  Komagataella        Komagataella phaffii
Phaffomycetaceae  Komagataella         Komagataella populi
Phaffomycetaceae  Komagataella Komagataella pseudopastoris
     Pichiaceae Kregervanrija    Kregervanrija delftensis
     Pichiaceae Kregervanrija       Kregervanrija fluxuum

Here is also my RSession information:

> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS  10.14

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] plyr_1.8.4           taxize_0.9.4         myTAI_0.8.0          biomartr_0.9.9000    BiocInstaller_1.24.0
[6] bindrcpp_0.2.2       biomaRt_2.30.0      

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0           ape_5.0              lattice_0.20-34      Biostrings_2.42.1    zoo_1.8-1           
 [6] utf8_1.1.3           assertthat_0.2.0     digest_0.6.15        foreach_1.4.4        R6_2.2.2            
[11] stats4_3.3.3         RSQLite_2.0          httr_1.3.1           pillar_1.2.1         zlibbioc_1.20.0     
[16] rlang_0.3.1          curl_3.1             rstudioapi_0.7       data.table_1.10.4-3  blob_1.1.0          
[21] S4Vectors_0.12.2     urltools_1.6.0       devtools_1.13.4      downloader_0.4       readr_1.3.1         
[26] stringr_1.3.0        RCurl_1.95-4.10      bit_1.1-12           triebeard_0.3.0      pkgconfig_2.0.1     
[31] BiocGenerics_0.20.0  tidyselect_0.2.5     tibble_1.4.2         httpcode_0.2.0       IRanges_2.8.2       
[36] codetools_0.2-15     XML_3.98-1.10        reshape_0.8.8        crayon_1.3.4         dplyr_0.7.8         
[41] withr_2.1.1          bitops_1.0-6         crul_0.7.0           grid_3.3.3           nlme_3.1-131        
[46] jsonlite_1.5         DBI_0.7              git2r_0.21.0         magrittr_1.5         cli_1.0.0           
[51] stringi_1.1.6        XVector_0.14.1       reshape2_1.4.3       xml2_1.1.1           iterators_1.0.10    
[56] tools_3.3.3          bold_0.8.6           bit64_0.9-7          Biobase_2.34.0       glue_1.3.0          
[61] purrr_0.2.5          hms_0.4.2            parallel_3.3.3       yaml_2.2.0           AnnotationDbi_1.36.2
[66] memoise_1.1.0        knitr_1.20           bindr_0.1.1 

Final remarks

Of course this simple script can be extended in many ways to incorporate exception handling, adding additional columns to the final dataframe and many more. I still hope that this will be useful to some people.

Written on January 23, 2019