---
title: "Introduction to microclass"
author: "Kristian Hovde Liland, Hilde Vinje and Lars Snipen"
date: "`r Sys.Date()`"
output: 
  rmarkdown::html_vignette:
        fig_caption: yes
vignette: >
  %\VignetteIndexEntry{Introduction to microclass}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## K-mers

All methods in this package are based on K-mer counting. K-mers are words of length $K$, found in the DNA sequence being analysed. There are $4^K$ possible words of length $K$, and all instances in a sequence are counted. Given the sequence "ATGCCTGAACTGACCTGC" we can, for instance, count 1-mers, 2-mers and 3-mers as follows:

```{r,message=FALSE}
library(microclass)
KmerCount("ATGCCTGAACTGACCTGC", K = 1, col.names = TRUE)
KmerCount("ATGCCTGAACTGACCTGC", K = 2, col.names = TRUE)
KmerCount("ATGCCTGAACTGACCTGC", K = 3, col.names = TRUE)
```


## Classification in one line of code

The quickest way to perform prokaryote classification is through the `taxMachine`. It is a genus level
classification methods based on the multinomial method, trained on the `microcontax` data set.

```{r, eval=F}
library(microcontax)
data(small.16S)
tax.tab <- taxMachine(small.16S$Sequence)
head(tax.tab)
```

```{r, eval=F}
genus <- sapply(strsplit(small.16S$Header,split=" "),function(x){x[2]})
cat("Number of errors:", sum(genus != tax.tab$Genus))
```
Notice that a model is created and cached in memory the first time you call `taxMachine()` in a session. This is to avoid storing a huge matrix. The next call to `taxMachine()` in this session will be (much) faster, since it will re-use the created model.


### Parallell computing
The number of processing cores to be used is controlled by the `setParallel()` command. The default is to used all available cores (logical and physical). Please remember to set this prior to a call if you run this on a computing cluster!


## Uncertainty
All input sequences are given a classification by the `taxMachine()`, but some may be very uncertain.
The `taxMachine()` also calculates two types of uncertainties: the `D.score` and the `R.score`. A small `D.score` (close to 0, less than 1) indicates an uncertain result because the sequence is also very similar to another genus, i.e. its posterior probability is almost equal for two different genera. The `R.score` indicates if the sequence is very different from any previously seen sequence, and the more negative the `R.score`, the more 'alien' the sequence is to the `taxMachine`.

```{r, fig.cap = "Classification uncertainty scores", eval=F}
par(mfrow = c(1,2), mar = c(2.5,2,3,1))
boxplot(tax.tab$D.score, factor(genus), main = "D-score")
boxplot(tax.tab$R.score, factor(genus), main = "R-score")
```

Along with each `R.score` value is also a `P.recognized` value. This is simply the probability that one of the training data sequences get an `R.score` of the same value, or lower. It is perhaps easier to interpret than the score itself. We typically use this to mark some sequences as unrecognized, and risky to include in further analyses:

```{r, eval=F}
tax.tab$Is.Recognized <- tax.tab$P.recognized>0.01
tax.tab[35:45,]
```

In most cases we would accept as recognized any sequence with `P.recognized>1e-4`, since there is a substantial variation also in 'proper' sequences. Sequences with a lot of sequencing errors, chimera or completely new taxa usually get extremely negative `R.scores`.


## Custom classification

If classification is to be carried out on a different taxonomic level than genus, using a
different training data set, or using different parameters, the RDP and multinomial methods are available.

The following example uses half the `small.16` data set to train an RDP model and then tries
to classify the other half of the data.

```{r, eval=F}
rdp <- rdpTrain(small.16S$Sequence[seq(1,71,2)], genus[seq(1,71,2)])  # training step
predicted <- rdpClassify(small.16S$Sequence[seq(2,71,2)], rdp)        # classification step
cat( "Number of errors:", sum(predicted != genus[seq(2,71,2)]) )
```