Title: | Basic Biological Sequence Handling |
---|---|
Description: | Basic functions for microbial sequence data analysis. The idea is to use generic R data structures as much as possible, making R data wrangling possible also for sequence data. |
Authors: | Lars Snipen, Kristian Hovde Liland |
Maintainer: | Lars Snipen <[email protected]> |
License: | GPL-2 |
Version: | 2.1.6 |
Built: | 2025-01-29 04:38:53 UTC |
Source: | https://github.com/larssnip/microseq |
A collection of functions for basic analysis of microbial sequence data.
microseq()
microseq()
Package: | microseq |
Type: | Package |
Version: | 2.1.6 |
Date: | 2023-08-21 |
License: | GPL-2 |
Lars Snipen, Kristian Hovde Liland
Maintainer: Lars Snipen <[email protected]>
Replaces aligned amino acids with their original codon triplets.
backTranslate(aa.msa, nuc.ffn)
backTranslate(aa.msa, nuc.ffn)
aa.msa |
A fasta object with a multiple alignment, see |
nuc.ffn |
A fasta object with the coding sequences, see |
This function replaces the aligned amino acids in aa.msa
with their
original codon triplets. This is possible only when the nucleotide sequences in nuc.ffn
are the exact nucleotide sequences behind the protein sequences that are aligned in aa.msa
.
It is required that the first token of the ‘Header’ lines is identical for a protein sequence
in aa.msa
and its nucleotide version in ‘nuc.ffn’, otherwise it is impossible to
match them. Thus, they may not appear in the same order in the two input fasta objects.
When aligning coding sequences, one should in general always align their protein sequences, to
keep the codon structure, and then use backTranslate
to convert this into a
nucleotide alignment, if required.
If the nuclotide sequences contain the stop codons, these will be removed.
A fasta object similar to aa.msa
, but where each amino acid has been replace by
its corresponding codon. All gaps ‘"-"’ are replaced by triplets ‘"---"’.
Lars Snipen.
msa.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.msa") aa.msa <- readFasta(msa.file) nuc.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.ffn") nuc <- readFasta(nuc.file) nuc.msa <- backTranslate(aa.msa, nuc)
msa.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.msa") aa.msa <- readFasta(msa.file) nuc.file <- file.path(file.path(path.package("microseq"),"extdata"), "small.ffn") nuc <- readFasta(nuc.file) nuc.msa <- backTranslate(aa.msa, nuc)
Finding coding genes in genomic DNA using the Prodigal software.
findGenes( genome, prodigal.exe = "prodigal", faa.file = "", ffn.file = "", proc = "single", trans.tab = 11, mask.N = FALSE, bypass.SD = FALSE )
findGenes( genome, prodigal.exe = "prodigal", faa.file = "", ffn.file = "", proc = "single", trans.tab = 11, mask.N = FALSE, bypass.SD = FALSE )
genome |
A table with columns Header and Sequence, containing the genome sequence(s). |
prodigal.exe |
Command to run the external software prodigal on the system (text). |
faa.file |
If provided, prodigal will output all proteins to this fasta-file (text). |
ffn.file |
If provided, prodigal will output all DNA sequences to this fasta-file (text). |
proc |
Either |
trans.tab |
Either 11 or 4 (see below). |
mask.N |
Turn on masking of N's (logical) |
bypass.SD |
Bypass Shine-Dalgarno filter (logical) |
The external software Prodigal is used to scan through a prokaryotic genome to detect the protein
coding genes. The text in prodigal.exe
must contain the exact command to invoke barrnap on the system.
In addition to the standard output from this function, FASTA files with protein and/or DNA sequences may
be produced directly by providing filenames in faa.file
and ffn.file
.
The input proc
allows you to specify if the input data should be treated as a single genome
(default) or as a metagenome. In the latter case the genome
are (un-binned) contigs.
The translation table is by default 11 (the standard code), but table 4 should be used for Mycoplasma etc.
The mask.N
will prevent genes having runs of N inside. The bypass.SD
turn off the search
for a Shine-Dalgarno motif.
A GFF-table (see readGFF
for details) with one row for each detected
coding gene.
The prodigal software must be installed on the system for this function to work, i.e. the command ‘system("prodigal -h")’ must be recognized as a valid command if you run it in the Console window.
Lars Snipen and Kristian Hovde Liland.
## Not run: # This example requires the external prodigal software # Using a genome file in this package. genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Searching for coding sequences, this is Mycoplasma (trans.tab = 4) genome <- readFasta(genome.file) gff.tbl <- findGenes(genome, trans.tab = 4) # Retrieving the sequences cds.tbl <- gff2fasta(gff.tbl, genome) # You may use the pipe operator library(ggplot2) readFasta(genome.file) %>% findGenes(trans.tab = 4) %>% filter(Score >= 50) %>% ggplot() + geom_histogram(aes(x = Score), bins = 25) ## End(Not run)
## Not run: # This example requires the external prodigal software # Using a genome file in this package. genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Searching for coding sequences, this is Mycoplasma (trans.tab = 4) genome <- readFasta(genome.file) gff.tbl <- findGenes(genome, trans.tab = 4) # Retrieving the sequences cds.tbl <- gff2fasta(gff.tbl, genome) # You may use the pipe operator library(ggplot2) readFasta(genome.file) %>% findGenes(trans.tab = 4) %>% filter(Score >= 50) %>% ggplot() + geom_histogram(aes(x = Score), bins = 25) ## End(Not run)
Finds all ORFs in prokaryotic genome sequences.
findOrfs(genome, circular = F, trans.tab = 11)
findOrfs(genome, circular = F, trans.tab = 11)
genome |
A fasta object ( |
circular |
Logical indicating if the genome sequences are completed, circular sequences. |
trans.tab |
Translation table. |
A prokaryotic Open Reading Frame (ORF) is defined as a sub-sequence
starting with a start-codon (ATG, GTG or TTG), followed by an integer number
of triplets (codons), and ending with a stop-codon (TAA, TGA or TAG, unless
trans.tab = 4
, see below). This function will locate all such ORFs in
a genome.
The argument genome
is a fasta object, i.e. a table with columns
‘Header’ and ‘Sequence’, and will typically have several sequences
(chromosomes/plasmids/scaffolds/contigs). It is vital that the first
token (characters before first space) of every ‘Header’ is
unique, since this will be used to identify these genome sequences in the
output.
By default the genome sequences are assumed to be linear, i.e. contigs or
other incomplete fragments of a genome. In such cases there will usually be
some truncated ORFs at each end, i.e. ORFs where either the start- or the
stop-codon is lacking. In the orf.table
returned by this function this
is marked in the ‘Attributes’ column. The texts "Truncated=10" or
"Truncated=01" indicates truncated at the beginning or end of the genomic
sequence, respectively. If the supplied genome
is a completed genome,
with circular chromosome/plasmids, set the flag circular = TRUE
and no
truncated ORFs will be listed. In cases where an ORF runs across the origin
of a circular genome sequences, the stop coordinate will be larger than the
length of the genome sequence. This is in line with the specifications of
the GFF3 format, where a ‘Start’ cannot be larger than the
corresponding ‘End’.
An alternative translation table may be specified, and as of now the only alternative implemented is table 4. This means codon TGA is no longer a stop, but codes for Tryptophan. This coding is used by some bacteria (e.g. under the orders Entomoplasmatales and Mycoplasmatales).
Note that for any given stop-codon there are usually multiple start-codons
in the same reading frame. This function will return all such nested ORFs,
i.e. the same stop position may appear multiple times. If you want ORFs with
the most upstream start-codon only (LORFs), then filter the output from this
function with lorfs
.
This function returns an orf.table
, which is simply a
tibble
with columns adhering to the GFF3 format specifications
(a gff.table
), see readGFF
. If you want to retrieve
the actual ORF sequences, use gff2fasta
.
Lars Snipen and Kristian Hovde Liland.
# Using a genome file in this package genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading genome and finding orfs genome <- readFasta(genome.file) orf.tbl <- findOrfs(genome) # Pipeline for finding LORFs of minimum length 100 amino acids # and collecting their sequences from the genome findOrfs(genome) %>% lorfs() %>% filter(orfLength(., aa = TRUE) > 100) %>% gff2fasta(genome) -> lorf.tbl
# Using a genome file in this package genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading genome and finding orfs genome <- readFasta(genome.file) orf.tbl <- findOrfs(genome) # Pipeline for finding LORFs of minimum length 100 amino acids # and collecting their sequences from the genome findOrfs(genome) %>% lorfs() %>% filter(orfLength(., aa = TRUE) > 100) %>% gff2fasta(genome) -> lorf.tbl
Finding rRNA genes in genomic DNA using the barrnap software.
findrRNA(genome, barrnap.exe = "barrnap", bacteria = TRUE, cpu = 1)
findrRNA(genome, barrnap.exe = "barrnap", bacteria = TRUE, cpu = 1)
genome |
A table with columns Header and Sequence, containing the genome sequence(s). |
barrnap.exe |
Command to run the external software barrnap on the system (text). |
bacteria |
Logical, the genome is either a bacteria (default) or an archea. |
cpu |
Number of CPUs to use, default is 1. |
The external software barrnap is used to scan through a prokaryotic genome to detect the
rRNA genes (5S, 16S, 23S).
The text in barrnap.exe
must contain the exact command to invoke barrnap on the system.
A GFF-table (see readGFF
for details) with one row for each detected
rRNA sequence.
Lars Snipen and Kristian Hovde Liland.
## Not run: # This example requires the external barrnap software # Using a genome file in this package. genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Searching for rRNA sequences, and inspecting genome <- readFasta(genome.file) gff.tbl <- findrRNA(genome) print(gff.table) # Retrieving the sequences rRNA <- gff2fasta(gff.tbl, genome) ## End(Not run)
## Not run: # This example requires the external barrnap software # Using a genome file in this package. genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Searching for rRNA sequences, and inspecting genome <- readFasta(genome.file) gff.tbl <- findrRNA(genome) print(gff.table) # Retrieving the sequences rRNA <- gff2fasta(gff.tbl, genome) ## End(Not run)
Retrieving from a genome the sequences specified in a gff.table
.
gff2fasta(gff.table, genome)
gff2fasta(gff.table, genome)
gff.table |
A |
genome |
A fasta object ( |
Each row in gff.table
(see readGFF
) describes a genomic feature
in the genome
, which is a tibble
with columns ‘Header’ and
‘Sequence’. The information in the columns Seqid, Start, End and Strand are used to
retrieve the sequences from the ‘Sequence’ column of genome
. Every Seqid in
the gff.table
must match the first token in one of the ‘Header’ texts, in
order to retrieve from the correct ‘Sequence’.
A fasta object with one row for each row in gff.table
.
The Header
for each sequence is a summary of the information in the
corresponding row of gff.table
.
Lars Snipen and Kristian Hovde Liland.
# Using two files in this package gff.file <- file.path(path.package("microseq"),"extdata","small.gff") genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading the genome first genome <- readFasta(genome.file) # Retrieving sequences gff.table <- readGFF(gff.file) fa.tbl <- gff2fasta(gff.table, genome) # Alternative, using piping readGFF(gff.file) %>% gff2fasta(genome) -> fa.tbl
# Using two files in this package gff.file <- file.path(path.package("microseq"),"extdata","small.gff") genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading the genome first genome <- readFasta(genome.file) # Retrieving sequences gff.table <- readGFF(gff.file) fa.tbl <- gff2fasta(gff.table, genome) # Alternative, using piping readGFF(gff.file) %>% gff2fasta(genome) -> fa.tbl
gregexpr
with substring retrievalAn extension of the function base::gregexpr
enabling retrieval of the
matching substrings.
gregexpr( pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, extract = FALSE )
gregexpr( pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE, extract = FALSE )
pattern |
Character string containing a regular expression (or character string for
|
text |
A character vector where matches are sought, or an object which can be coerced by
|
ignore.case |
If |
perl |
Logical. Should perl-compatible regexps be used? Has priority over |
fixed |
Logical. If |
useBytes |
Logical. If |
extract |
Logical indicating if matching substrings should be extracted and returned. |
Extended version of base:gregexpr
that enables the return of the substrings matching
the pattern. The last argument ‘extract’ is the only difference to base::gregexpr
. The default
behaviour is identical to base::gregexpr
, but setting extract=TRUE
means the matching substrings
are returned.
It will either return what the base::gregexpr
would (extract = FALSE
) or a ‘list’
of substrings matching the pattern (extract = TRUE
). There is one ‘list’ element for each string in
‘text’, and each list element contains a character vector of all matching substrings in the corresponding
entry of ‘text’.
Lars Snipen and Kristian Liland.
sequences <- c("ACATGTCATGTCC", "CTTGTATGCTG") gregexpr("ATG", sequences, extract = TRUE)
sequences <- c("ACATGTCATGTCC", "CTTGTATGCTG") gregexpr("ATG", sequences, extract = TRUE)
Converting DNA ambiguity symbols to regular expressions, and vice versa.
iupac2regex(sequence) regex2iupac(sequence)
iupac2regex(sequence) regex2iupac(sequence)
sequence |
Character vector containing DNA sequences. |
The DNA alphabet may contain ambiguity symbols, e.g. a W means either A or T.
When using a regular expression search, these letters must be replaced by the proper
regular expression, e.g. W is replaced by [AT] in the string. The iupac2regex
makes this
translation, while regex2iupac
converts the other way again (replace [AT] with W).
A string where the ambiguity symbol has been replaced by a regular expression
(iupac2regex
) or a regular expression has been replaced by an ambiguity symbol
(regex2iupac
).
Lars Snipen.
iupac2regex("ACWGT") regex2iupac("AC[AG]GT")
iupac2regex("ACWGT") regex2iupac("AC[AG]GT")
Filtering an orf.table
with ORF information to keep only the LORFs.
lorfs(orf.tbl)
lorfs(orf.tbl)
orf.tbl |
A |
For every stop-codon there are usually multiple possible start-codons in the same reading frame (nested ORFs). The LORF (Longest ORF) is defined as the longest of these nested ORFs, i.e. the ORF starting at the most upstream start-codon matching the stop-codon.
A tibble
with a subset of the rows of the argument orf.tbl
.
After this filtering the Type variable in orf.tbl
is changed to "LORF"
. If you want to
retrieve the LORF sequences, use gff2fasta
.
Lars Snipen and Kristian Hovde Liland.
# See the example in the Help-file for findOrfs.
# See the example in the Help-file for findOrfs.
Converts a FASTA formatted multiple alignment to a matrix.
msa2mat(msa)
msa2mat(msa)
msa |
A fasta object with a multiple alignment, see |
This function converts the fasta object msa
, containing a multiple alignment,
to a matrix. This means each position in the alignment is a column in the matrix, and the
content of the ‘Header’ column of msa
is used as rownames of theh matrix.
Such a matrix is useful for conversion to a DNAbin
object that is used by the ape
package for re-constructing phylogenetic trees.
A matrix
where each row is a vector of aligned bases/amino acids.
Lars Snipen.
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa") msa <- readFasta(msa.file) msa.mat <- msa2mat(msa) # to use with ape::as.DNAbin(msa.mat)
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa") msa <- readFasta(msa.file) msa.mat <- msa2mat(msa) # to use with ape::as.DNAbin(msa.mat)
Quickly computing a smallish multiple sequence alignment.
msalign(fsa.tbl, machine = "microseq::muscle")
msalign(fsa.tbl, machine = "microseq::muscle")
fsa.tbl |
A fasta object ( |
machine |
Function that does the 'dirty work'. |
This function computes a multiple sequence alignment given a set of sequences in
a fasta object, see readFasta
for more on fasta objects.
It is merely a wrapper for the function named in machine
to avoid explicit writing
and reading of files. This function should only be used for small data sets, since no result
files are stored. For heavier jobs, use the machine
function directly.
At present, the only machine
function implemented is muscle
, but other
third-party machine
s may be included later.
Note that this function will run muscle
with default settings, which is fine
for small data sets.
Results are returned as a fasta object, i.e. a tibble
with columns
‘Header’ and ‘Sequence’.
Lars Snipen.
## Not run: prot.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa") faa <- readFasta(prot.file) msa <- msalign(faa) ## End(Not run)
## Not run: prot.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa") faa <- readFasta(prot.file) msa <- msalign(faa) ## End(Not run)
Trimming a multiple sequence alignment by discarding columns with too many gaps.
msaTrim(msa, gap.end = 0.5, gap.mid = 0.9)
msaTrim(msa, gap.end = 0.5, gap.mid = 0.9)
msa |
A fasta object containing a multiple alignment. |
gap.end |
Fraction of gaps tolerated at the ends of the alignment (0-1). |
gap.mid |
Fraction of gaps tolerated inside the alignment (0-1). |
A multiple alignment is trimmed by removing columns with too many indels (gap-symbols). Any
columns containing a fraction of gaps larger than gap.mid
are discarded. For this reason, gap.mid
should always be farily close to 1.0 therwise too many columns may be discarded, destroying the alignment.
Due to the heuristics of multiple alignment methods, both ends of the alignment tend to be uncertain and most
of the trimming should be done at the ends. Starting at each end, columns are discarded as long as their fraction of gaps
surpasses gap.end
. Typically gap.end
can be much smaller than gap.mid
, but if
set too low you risk that all columns are discarded!
The trimmed alignment is returned as a fasta object.
Lars Snipen.
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa") msa <- readFasta(msa.file) print(str_length(msa$Sequence)) msa.trimmed <- msaTrim(msa) print(str_length(msa.trimmed$Sequence)) msa.mat <- msa2mat(msa) # for use with ape::as.DNAbin(msa.mat)
msa.file <- file.path(path.package("microseq"),"extdata", "small.msa") msa <- readFasta(msa.file) print(str_length(msa$Sequence)) msa.trimmed <- msaTrim(msa) print(str_length(msa.trimmed$Sequence)) msa.mat <- msa2mat(msa) # for use with ape::as.DNAbin(msa.mat)
Computing a multiple sequence alignment using the MUSCLE software.
muscle( in.file, out.file, muscle.exe = "muscle", quiet = FALSE, diags = FALSE, maxiters = 16 )
muscle( in.file, out.file, muscle.exe = "muscle", quiet = FALSE, diags = FALSE, maxiters = 16 )
in.file |
Name of FASTA file with input sequences. |
out.file |
Name of file to store the result. |
muscle.exe |
Command to run the external software muscle on the system (text). |
quiet |
Logical, |
diags |
Logical, |
maxiters |
Maximum number of iterations. |
The software MUSCLE (Edgar, 2004) must be installed and available on the system. The text in
muscle.exe
must contain the exact command to invoke muscle on the system.
By default diags = FALSE
but can be set to TRUE
to increase speed. This should be done
only if sequences are highly similar.
By default maxiters = 16
. If you have a large number of sequences (a few thousand), or they are
very long, then this may be too slow for practical use. A good compromise between speed and accuracy
is to run just the first two iterations of the algorithm. On average, this gives accuracy equal to
T-Coffee and speeds much faster than CLUSTALW. This is done by the option maxiters = 2
.
The result is written to the file specified in out.file
.
Lars Snipen.
Edgar, R.C. (2004). MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic Acids Res, 32, 1792-1797.
## Not run: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa") muscle(in.file = fa.file, out.file = "delete_me.msa") ## End(Not run)
## Not run: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.faa") muscle(in.file = fa.file, out.file = "delete_me.msa") ## End(Not run)
Computing the lengths of all ORFs in an orf.table
.
orfLength(orf.table, aa = FALSE)
orfLength(orf.table, aa = FALSE)
orf.table |
A GFF-formatted |
aa |
Logical, length in amino acids instead of bases. |
By default, computes the length of an ORF in bases, including the
stop codon. However, if aa = TRUE
, then the length is in amino acids
after translation. This aa-length is the base-length divided by 3 and minus
1, unless the ORF is truncated and lacks a stop codon.
A vector of lengths.
Lars Snipen and Kristian Hovde Liland.
# See the example in the Help-file for findOrfs.
# See the example in the Help-file for findOrfs.
Creates a signature text for orfs in an orf.table
.
orfSignature(orf.table, full = TRUE)
orfSignature(orf.table, full = TRUE)
orf.table |
A |
full |
Logical indicating type of signature. |
A signature is a text that uniquely identifies each ORF in an
orf.table
, which is a GFF-table with columns Seqid
, Start
,
End
and Strand
.
The full signature (full = TRUE
) contains the Seqid
, Start
,
End
and Strand
information for each ORF, separated by
semicolon ";"
. This text is always unique
to each ORF. If full = FALSE
the Signature
will not contain
the starting position information for each ORF. This means all nested ORFs ending
at the same stop-codon will then get identical Signature
s. This is
useful for identifying which ORFs are nested within the same LORF.
Note that the signature you get with full = FALSE
contains Seqid
,
then End
if on the positive Strand
, Start
otherwise, and then
the Strand
.
A text vector with the Signature
for each ORF.
Lars Snipen.
# Using a genome file in this package genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading genome and finding orfs genome <- readFasta(genome.file) orf.tbl <- findOrfs(genome) # Compute signatures signature.full <- orfSignature(orf.tbl) signature.reduced <- orfSignature(orf.tbl, full = FALSE)
# Using a genome file in this package genome.file <- file.path(path.package("microseq"),"extdata","small.fna") # Reading genome and finding orfs genome <- readFasta(genome.file) orf.tbl <- findOrfs(genome) # Compute signatures signature.full <- orfSignature(orf.tbl) signature.reduced <- orfSignature(orf.tbl, full = FALSE)
Reads and writes biological sequences (DNA, RNA, protein) in the FASTA format.
readFasta(in.file) writeFasta(fdta, out.file, width = 0)
readFasta(in.file) writeFasta(fdta, out.file, width = 0)
in.file |
url/directory/name of (gzipped) FASTA file to read. |
fdta |
A |
out.file |
Name of (gzipped) FASTA file to create. |
width |
Number of characters per line, or 0 for no linebreaks. |
These functions handle input/output of sequences in the commonly used FASTA format.
For every sequence it is presumed there is one Header-line starting with a ‘>’. If
filenames (in.file
or out.file
) have the extension .gz
they will automatically be
compressed/uncompressed.
The sequences are stored in a tibble
, opening up all the possibilities in R for
fast and easy manipulations. The content of the file is stored as two columns, ‘Header’
and ‘Sequence’. If other columns are added, these will be ignored by writeFasta
.
The default width = 0
in writeFasta
results in no linebreaks in the sequences
(one sequence per line).
readFasta
returns a tibble
with the contents of the (gzipped) FASTA
file stored in two columns of text. The first, named ‘Header’, contains
the headerlines and the second, named ‘Sequence’, contains the sequences.
writeFasta
produces a (gzipped) FASTA file.
Lars Snipen and Kristian Hovde Liland.
## Not run: # We need a FASTA-file to read, here is one example file: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") # Read and write fdta <- readFasta(fa.file) ok <- writeFasta(fdta[4:5,], out.file = "delete_me.fasta") # Make use of dplyr to copy parts of the file to another file readFasta(fa.file) %>% filter(str_detect(Sequence, "TGA$")) %>% writeFasta(out.file = "TGAstop.fasta", width = 80) -> ok ## End(Not run)
## Not run: # We need a FASTA-file to read, here is one example file: fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") # Read and write fdta <- readFasta(fa.file) ok <- writeFasta(fdta[4:5,], out.file = "delete_me.fasta") # Make use of dplyr to copy parts of the file to another file readFasta(fa.file) %>% filter(str_detect(Sequence, "TGA$")) %>% writeFasta(out.file = "TGAstop.fasta", width = 80) -> ok ## End(Not run)
Reads and writes files in the FASTQ format.
readFastq(in.file) writeFastq(fdta, out.file)
readFastq(in.file) writeFastq(fdta, out.file)
in.file |
url/directory/name of (gzipped) FASTQ file to read. |
fdta |
FASTQ object to write. |
out.file |
url/directory/name of (gzipped) FASTQ file to write. |
These functions handle input/output of sequences in the commonly used FASTQ format,
typically used for storing DNA sequences (reads) after sequencing. If
filenames (in.file
or out.file
) have the extension .gz
they will automatically be
compressed/uncompressed.
The sequences are stored in a tibble
, opening up all the possibilities in R for
fast and easy manipulations. The content of the file is stored as three columns, ‘Header’,
‘Sequence’ and ‘Quality’. If other columns are added, these will be ignored by
writeFastq
.
readFastq
returns a tibble
with the contents of the (gzipped) FASTQ
file stored in three columns of text. The first, named ‘Header’, contains
the headerlines, the second, named ‘Sequence’, contains the sequences and the third, named
‘Quality’ contains the base quality scores.
writeFastq
produces a (gzipped) FASTQ file.
These functions will only handle files where each entry spans one single line, i.e. not the (uncommon) multiline FASTQ format.
Lars Snipen and Kristian Hovde Liland.
codereadFasta.
## Not run: # We need a FASTQ-file to read, here is one example file: fq.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.fastq.gz") # Read and write fdta <- readFastq(fq.file) ok <- writeFastq(fdta[1:3,], out.file = "delete_me.fq") # Make use of dplyr to copy parts of the file to another file readFastq(fq.file) %>% mutate(Length = str_length(Sequence)) %>% filter(Length > 200) %>% writeFasta(out.file = "long_reads.fa") # writing to FASTA file ## End(Not run)
## Not run: # We need a FASTQ-file to read, here is one example file: fq.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.fastq.gz") # Read and write fdta <- readFastq(fq.file) ok <- writeFastq(fdta[1:3,], out.file = "delete_me.fq") # Make use of dplyr to copy parts of the file to another file readFastq(fq.file) %>% mutate(Length = str_length(Sequence)) %>% filter(Length > 200) %>% writeFasta(out.file = "long_reads.fa") # writing to FASTA file ## End(Not run)
Reading or writing a GFF-table from/to file.
readGFF(in.file) writeGFF(gff.table, out.file)
readGFF(in.file) writeGFF(gff.table, out.file)
in.file |
Name of file with a GFF-table. |
gff.table |
A table ( |
out.file |
Name of file. |
A GFF-table is simply a tibble
with columns
adhering to the format specified by the GFF3 format, see
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md for details. There is
one row for each feature.
The following columns should always be in a full gff.table
of the GFF3 format:
Seqid. A unique identifier of the genomic sequence on which the feature resides.
Source. A description of the procedure that generated the feature, e.g. "R-package micropan::findOrfs"
.
Type The type of feature, e.g. "ORF"
, "16S"
etc.
Start. The leftmost coordinate. This is the start if the feature is on the Sense strand, but the end if it is on the Antisense strand.
End. The rightmost coordinate. This is the end if the feature is on the Sense strand, but the start if it is on the Antisense strand.
Score. A numeric score (E-value, P-value) from the Source
.
Strand. A "+"
indicates Sense strand, a "-"
Antisense.
Phase. Only relevant for coding genes. the values 0, 1 or 2 indicates the reading frame, i.e.
the number of bases to offset the Start
in order to be in the reading frame.
Attributes. A single string with semicolon-separated tokens prociding additional information.
Missing values are described by "."
in the GFF3 format. This is also done here, except for the
numerical columns Start, End, Score and Phase. Here NA
is used, but this is replaced by
"."
when writing to file.
The readGFF
function will also read files where sequences in FASTA format are added after
the GFF-table. This file section must always start with the line ##FASTA
. This fasta object
is added to the GFF-table as an attribute (use attr(gff.tbl, "FASTA")
to retrieve it).
readGFF
returns a gff.table
with the columns described above.
writeGFF
writes the supplied gff.table
to a text-file.
Lars Snipen and Kristian Hovde Liland.
# Using a GFF file in this package gff.file <- file.path(path.package("microseq"),"extdata","small.gff") # Reading gff-file gff.tbl <- readGFF(gff.file)
# Using a GFF file in this package gff.file <- file.path(path.package("microseq"),"extdata","small.gff") # Reading gff-file gff.tbl <- readGFF(gff.file)
The standard reverse-complement of nucleotide sequences.
reverseComplement(nuc.sequences, reverse = TRUE)
reverseComplement(nuc.sequences, reverse = TRUE)
nuc.sequences |
Character vector containing the nucleotide sequences. |
reverse |
Logical indicating if complement should be reversed. |
With ‘reverse = FALSE’ the DNA sequence is only complemented, not reversed.
This function will handle the IUPAC ambiguity symbols, i.e. ‘R’ is reverse-complemented to ‘Y’ etc.
A character vector of reverse-complemented sequences.
Lars Snipen and Kristian Hovde Liland.
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") fa <- readFasta(fa.file) reverseComplement(fa$Sequence) #' # Or, make use of dplyr to manipulate tables readFasta(fa.file) %>% mutate(RevComp = reverseComplement(Sequence)) -> fa.tbl
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") fa <- readFasta(fa.file) reverseComplement(fa$Sequence) #' # Or, make use of dplyr to manipulate tables readFasta(fa.file) %>% mutate(RevComp = reverseComplement(Sequence)) -> fa.tbl
The translation from DNA(RNA) to amino acid sequence according to the standard genetic code.
translate(nuc.sequences, M.start = TRUE, no.stop = TRUE, trans.tab = 11)
translate(nuc.sequences, M.start = TRUE, no.stop = TRUE, trans.tab = 11)
nuc.sequences |
Character vector containing the nucleotide sequences. |
M.start |
A logical indicating if the amino acid sequence should start with M regardless of start codon. |
no.stop |
A logical indicating if terminal stops (*) should be eliminated from the translated sequence |
trans.tab |
Translation table, either 11 or 4 |
Codons are by default translated according to translation table 11, i.e. the possible start codons
are ATG, GTG or TTG and stop codons are TAA, TGA and TAG. The only alternative implemented here is
translation table 4, which is used by some bacteria (e.g. Mycoplasma, Mesoplasma). If trans.tab
is 4,
the stop codon TGA istranslated to W (Tryptophan).
A character vector of translated sequences.
Lars Snipen and Kristian Hovde Liland.
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") fa <- readFasta(fa.file) translate(fa$Sequence) # Or, make use of dplyr to manipulate tables readFasta(fa.file) %>% mutate(Protein = translate(Sequence)) -> fa.tbl
fa.file <- file.path(file.path(path.package("microseq"),"extdata"),"small.ffn") fa <- readFasta(fa.file) translate(fa$Sequence) # Or, make use of dplyr to manipulate tables readFasta(fa.file) %>% mutate(Protein = translate(Sequence)) -> fa.tbl