All methods in this package are based on K-mer counting. K-mers are words of length K, found in the DNA sequence being analysed. There are 4K possible words of length K, and all instances in a sequence are counted. Given the sequence “ATGCCTGAACTGACCTGC” we can, for instance, count 1-mers, 2-mers and 3-mers as follows:
## A C G T
## [1,] 4 6 4 4
## AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
## [1,] 1 2 0 1 0 2 0 3 2 2 0 0 0 0 4 0
## AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC
## [1,] 0 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0
## CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT
## [1,] 0 0 0 0 0 2 0 0 0 0 0 0 3 0 1 1 0 0
## GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC
## [1,] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
## [1,] 0 0 2 2 0 0 0 0 0 0
The quickest way to perform prokaryote classification is through the
taxMachine
. It is a genus level classification methods
based on the multinomial method, trained on the microcontax
data set.
genus <- sapply(strsplit(small.16S$Header,split=" "),function(x){x[2]})
cat("Number of errors:", sum(genus != tax.tab$Genus))
Notice that a model is created and cached in memory the first time
you call taxMachine()
in a session. This is to avoid
storing a huge matrix. The next call to taxMachine()
in
this session will be (much) faster, since it will re-use the created
model.
The number of processing cores to be used is controlled by the
setParallel()
command. The default is to used all available
cores (logical and physical). Please remember to set this prior to a
call if you run this on a computing cluster!
All input sequences are given a classification by the
taxMachine()
, but some may be very uncertain. The
taxMachine()
also calculates two types of uncertainties:
the D.score
and the R.score
. A small
D.score
(close to 0, less than 1) indicates an uncertain
result because the sequence is also very similar to another genus,
i.e. its posterior probability is almost equal for two different genera.
The R.score
indicates if the sequence is very different
from any previously seen sequence, and the more negative the
R.score
, the more ‘alien’ the sequence is to the
taxMachine
.
par(mfrow = c(1,2), mar = c(2.5,2,3,1))
boxplot(tax.tab$D.score, factor(genus), main = "D-score")
boxplot(tax.tab$R.score, factor(genus), main = "R-score")
Along with each R.score
value is also a
P.recognized
value. This is simply the probability that one
of the training data sequences get an R.score
of the same
value, or lower. It is perhaps easier to interpret than the score
itself. We typically use this to mark some sequences as unrecognized,
and risky to include in further analyses:
In most cases we would accept as recognized any sequence with
P.recognized>1e-4
, since there is a substantial
variation also in ‘proper’ sequences. Sequences with a lot of sequencing
errors, chimera or completely new taxa usually get extremely negative
R.scores
.
If classification is to be carried out on a different taxonomic level than genus, using a different training data set, or using different parameters, the RDP and multinomial methods are available.
The following example uses half the small.16
data set to
train an RDP model and then tries to classify the other half of the
data.