Title: | The ConTax Data Package |
---|---|
Description: | The consensus taxonomy for prokaryotes is a set of data-sets for best possible taxonomic classification based on 16S rRNA sequence data. |
Authors: | Hilde Vinje, Kristian Hovde Liland, Lars Snipen. |
Maintainer: | Lars Snipen <[email protected]> |
License: | GPL-2 |
Version: | 1.2 |
Built: | 2024-11-08 05:13:41 UTC |
Source: | https://github.com/larssnip/microcontax |
The consensus taxonomy for prokaryotes is a package of data sets designed to be the best possible for training taxonomic classifiers based on 16S rRNA sequence data.
microcontax()
microcontax()
Package: | microcontax |
Type: | Package |
Version: | 1.2 |
Date: | 2020-06-06 |
License: | GPL-2 |
Hilde Vinje, Kristian Liland, Lars Snipen.
Maintainer: Lars Snipen <[email protected]>
The trimmed version of the ConTax data set.
data(contax.trim)
data(contax.trim)
contax.trim
is a data.frame
object containing 38 781 full-length 16S rRNA
sequences. It is the trimmed version of the full data set (see below). Large taxa (many sequences) have
been trimmed as described in Vinje et al. (2016) to obtain a data set with a more even representation of
the prokaryotic taxonomy.
The contax.full
is the full consensus taxonomy data set as described in Vinje et al. (2016). The data
set is too large for CRAN and thus available as a separate package microcontax.data
. See example
below for how to obtain contax.full
.
The Header of every sequence starts with a unique tag, in this case the text "ConTax" and some integer. This is followed by a token describing the origin of the sequence. It is typically
"Intersection=SRG"
meaning it is found in both the Silva, RDP and Greengenes data repository. Intersections can also be SR, SG and RG if the sequence was found in two repositories only. The taxonomy information for each sequence is found in the third token. It follows a commonly used format:
"k__<...>;p__<...>;c__<...>;o__<...>;f__<...>;g__<...>;"
where <...> is some proper text. The letters, followed by a double underscore, refer to the taxonomic levels Domain (Kingdom), Phylum, Class, Order, Family and Genus. Here is an example of a proper string:
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;"
As long as this format is used the taxonomy information can be extracted by the supplied
extractor-functions getDomain
, getPhylum
,...,getGenus
.
Hilde Vinje, Kristian Hovde Liland, Lars Snipen.
medoids
, getDomain
, contax.full
.
data(contax.trim) dim(contax.trim) # Write to FASTA-file ## Not run: writeFasta(contax.trim,out.file="ConTax_trim.fasta") # Install microcontax.data with the BIG contax.full data set if (!requireNamespace("microcontax.data", quietly = TRUE)) { install.packages("microcontax.data") } # Load data data("contax.full", package = "microcontax.data") ## End(Not run)
data(contax.trim) dim(contax.trim) # Write to FASTA-file ## Not run: writeFasta(contax.trim,out.file="ConTax_trim.fasta") # Install microcontax.data with the BIG contax.full data set if (!requireNamespace("microcontax.data", quietly = TRUE)) { install.packages("microcontax.data") } # Load data data("contax.full", package = "microcontax.data") ## End(Not run)
Converts a genus to a string containing the full taxonomy.
fullTaxonomy(genera)
fullTaxonomy(genera)
genera |
A vector of texts, the genera names to look up. |
The argument genera
must consist of names in the Genus
column of the data
set taxonomy.table
.
"k__<...>;p__<...>;c__<...>;o__<...>;f__<...>;g__<...>;"
where <...> is some proper text.
A character vector containing the taxonomy information.
Lars Snipen.
genera <- c("Bacillus","Clostridium","Hyphomonas") fullTaxonomy(genera)
genera <- c("Bacillus","Clostridium","Hyphomonas") fullTaxonomy(genera)
Extracting taxonomic information from the taxonomy.table
.
genusLookup(genera, rank = "Phylum")
genusLookup(genera, rank = "Phylum")
genera |
A vector of texts, the genera names to look up. |
rank |
A single text, the level of the taxonomy to look up. |
Function for looking up higher-level taxonomy of specified genera.
The argument genera
must consist of names in the Genus
column of the data
set taxonomy.table
.
A character vector containing the taxonomy information. Names in genera
not recognized will
return NA
. Please note that there are some cases of un-assigned taxonomy at some ranks
(Class, Order or Family), this is returned as "unknown".
Hilde Vinje, Lars Snipen.
genus <- c("Acidilobus","Nitrosopumilus","Hyphomonas") genusLookup(genus, rank = "Phylum") genusLookup(genus, rank = "Class")
genus <- c("Acidilobus","Nitrosopumilus","Hyphomonas") genusLookup(genus, rank = "Phylum") genusLookup(genus, rank = "Class")
Extracting taxonomic information from ConTax data sets.
getDomain(header) getPhylum(header) getClass(header) getOrder(header) getFamily(header) getGenus(header) getTag(header) getTaxonomy(header)
getDomain(header) getPhylum(header) getClass(header) getOrder(header) getFamily(header) getGenus(header) getTag(header) getTaxonomy(header)
header |
A vector of texts, typically the |
The ConTax data sets are tables in the FASTA format (see readFasta
),
where the Header
column contains texts according to a strict format.
The header
always starts with a short text, a Tag, which is a unique identifier for every sequence.
The function getTag
will extract this from the header
.
After the Tag follows one or more tokens. One of these tokens must be a string with the following format:
"k__<...>;p__<...>;c__<...>;o__<...>;f__<...>;g__<...>;"
where <...> is some proper text. Here is an example of a proper string:
"k__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Staphylococcaceae;g__Staphylococcus;"
The functions getDomain
, ..., getGenus
extracts the
corresponding information from the header
. getTaxonomy
combines all taxonomy extractors, combines these in a table
and imputes missing taxa with parent taxa.
A vector containing the sub-texts extracted from each header
text, but
getTaxonomy
returns a table with the full taxonomy, one row for each input header
Lars Snipen.
data(contax.trim) getTag(contax.trim$Header) getGenus(contax.trim$Header) getPhylum(contax.trim$Header)
data(contax.trim) getTag(contax.trim$Header) getGenus(contax.trim$Header) getPhylum(contax.trim$Header)
The genus medoids from the ConTax data set.
data(medoids)
data(medoids)
medoids
is a data.frame
object containing the medoide sequences for each genus in
the ConTax data sets (both contax.trim
and contax.full
).
The medoide sequence in a genus is the sequence having the smallest sum of distance to all other members of the same genus. Thus, it is the sequence closest to the centre of the genus. The medoids can be used as the representative of each genus, e.g. for building trees for the entire taxonomy.
The taxonomy information for each sequence can be extracted from the Header
column by the supplied
extractor-functions getDomain
, getPhylum
,...,getGenus
.
Hilde Vinje, Kristian Hovde Liland, Lars Snipen.
data(medoids) summary(medoids)
data(medoids) summary(medoids)
A data frame consisting of the taxonomy information used in the ConTax data sets.
data(taxonomy.table)
data(taxonomy.table)
taxonomy.table
is a data.frame
consisting of the seven columns Domain,
Phylum, Class, Order, Family, Genus and LPSN. The first six are taxonomy informations, the last
is "Yes" or "No" indiocating if the Genus listed is also found in the List of prokaryotic names
with standing in nomenclature (LPSN) database, see http://www.bacterio.net/.
Each row contains the taxonomy information for a genus, hence the number of rows equals the number of unique genera.
To quickly look-up the higher rank taxonomy for a given genus, see the function genusLookup
.
Hilde Vinje, Kristian Hovde Liland, Lars Snipen.
genusLookup
, contax.full
,
contax.trim
, getDomain
.
data(taxonomy.table) dim(taxonomy.table) taxonomy.table[1:10,] genusLookup(taxonomy.table$Genus[1:10], rank = "Family")
data(taxonomy.table) dim(taxonomy.table) taxonomy.table[1:10,] genusLookup(taxonomy.table$Genus[1:10], rank = "Family")