vtamR is a revised, completed version of VTAM (Validation and Taxonomic Assignation of Metabarcoding Data) rewritten in R. It is a complete metabarcoding pipeline:
Novelties compared to VTAM:
vtamR is highly
adaptable to include/exclude and order different steps of the
analysesPlease, follow the Installation instructions.
Load library
library(vtamR)
Set the path to third-party programs
If the third-party programs are not in your
PATH (see Installation),
adjust the xxx_path variables according to
your setup.
# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
blast_path <- "C:/Users/Public/blast-2.16.0+/bin/blastn"
swarm_path <- "C:/Users/Public/swarm-3.1.5-win-x86_64/bin/swarm"
pigz_path <- "C:/Users/Public/pigz-win32/pigz" # optional
# Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt"
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch"
blast_path <- "~/miniconda3/envs/vtam/bin/blastn"
swarm_path <- "swarm" # swarm is in the PATH
pigz_path <- "pigz" # optional; pigz is in the PATH
If the third-party programs are in your
PATH, you have two options:
xxx_path variables as shown below,
and run the tutorial examples exactly as written.# All executables are in the PATH
cutadapt_path <- "cutadapt"
vsearch_path <- "vsearch"
blast_path <- "blastn"
swarm_path <- "swarm"
pigz_path <- "pigz"
xxx_path arguments at all,
and simply omit them when calling the vtamR functions.For example, run:
read_count_df <- FilterChimera(read_count_df)
instead of:
read_count_df <- FilterChimera(
read_count_df,
vsearch_path = vsearch_path) # can be omitted if VSEARCH is in the PATH
Set general parameters
sep <- ","
outdir <- "vtamR_demo_out"
sep: Separator used in csv filesoutdir: Name of the output directory. Make sure there
is no space in the path.Input Data
The demo files below are included with the vtamR
package, which is why we use system.file() to access them
in this tutorial. When using your own data, simply provide the file and
directory names (e.g. ~/vtamR/fastq)
fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <- system.file("extdata/demo/fastqinfo.csv", package = "vtamR")
mock_composition <- system.file("extdata/demo/mock_composition.csv", package = "vtamR")
asv_list <- system.file("extdata/demo/asv_list.csv", package = "vtamR")
fastq_dir: Directory containing the input fastq
files.asv_ids from earlier data sets. Optional.Database for taxonomic assignment
Set the path to the database for taxonomic assignment and to the
accompanying taxonomy file. For this example, we use a very
small database included in the vtamR package. To
see how to get a real database visit TaxAssign
reference data base.
When using your own data just enter the file names, without using the
system.file function.
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
taxonomy: CSV file with taxonomic informationblast_db: BLAST databaseDetails are in Reference database for taxonomic assignments section
Check the format of input files
The CheckFileinfo function tests if all obligatory
columns are present, and makes some sanity checks. This can be helpful,
since these files are produced by the users, and may contain errors
difficult to spot by eye.
CheckFileinfo(file=fastqinfo, dir=fastq_dir, file_type="fastqinfo")
#> [1] "Sample names are alphanumerical : OK"
#> [1] "Only IUPAC characters in tag_fw: OK"
#> [1] "Only IUPAC characters in tag_rv: OK"
#> [1] "Only IUPAC characters in primer_fw: OK"
#> [1] "Only IUPAC characters in primer_rv: OK"
#> [1] "Coherence between samples and sample_type: OK"
#> [1] "Coherence between samples and habitat: OK"
#> [1] "sample_type: OK"
#> [1] "Coherence between samples and replicates: OK"
#> [1] "File extension: OK"
#> [1] "Coherence between fw and rv fastq filename: OK"
#> [1] "Unique tag combinations : OK"
CheckFileinfo(file=mock_composition, file_type="mock_composition")
#> [1] "Sample names are alphanumerical : OK"
#> [1] "Only IUPAC characters in ASV: OK"
#> [1] "Action types : OK"
CheckFileinfo(file=asv_list, file_type="asv_list")
#> [1] "Only IUPAC characters in ASV: OK"
#> [1] "1 to 1 relation between asv_id and asv : OK"
Before starting the analyses, check the Tips – Good to Know section. It summarizes several general features and options of
vtamRand will help you better understand how to run the functions effectively.
According to your wetlab and sequencing protocol each fastq file can contain one or more sample-replicates, and sequences may or may not contain tags (for demultiplexing) and primer sequences.
In this tutorial, the fastq files contain reads from several samples, so read pairs must be
Merge)SortReads)See the From fastq to data frame on how to deal with other experimental designs.
Merge and quality filter
Merge will make a single sequence from each read-pair.
It can also be used for quality filtering.
compress = FALSE In the tutorial we will use
uncompressed files, but see the To compress or not to
compress section to find the best strategy for you.
See also Files to keep on how to handle the large intermediate files that should be compressed or deleted once the analyses are complete.
See ?Merge for more options on quality filtering.
Merge. It is the updated version of fastqinfo,
where fastq file names have been replaced by fasta file names and the
read counts are included for each file.merged_dir <- file.path(outdir, "merged")
fastainfo_df <- Merge(fastqinfo=fastqinfo,
fastq_dir=fastq_dir, # directory of input fasta files
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outdir=merged_dir, # directory of output fasta files
fastq_maxee=1, # Maximum Expected Error for each read
fastq_maxns=0, # no Ns in output
fastq_allowmergestagger=T, # reads might be longer than the marker
compress=FALSE) # uncompressed output
Randomly select sequences
The RandomSeq function randomly selects n
sequences (without replacement) from each input FASTA file listed in the
fastainfo data frame.
This step can be useful to normalize the number of reads across sequencing libraries, but it is entirely up to you whether to include it in your workflow.
In this tutorial, I use RandomSeq to ensure that all
replicate series (i.e., each FASTA file contains different replicates of
the same samples) contain the same (or similar) number of reads.
There are two different methods to perform random sampling:
Using fastx_subsample command of VSEARCH
(use_vsearch = TRUE). This method is very fast but
only available on Linux-like operating systems.
Using R Only (use_vsearch = FALSE). This version is
slower but works on all operating systems.
The following example uses a slower but cross-platform version. See RandomSeq for more information.
randomseq_dir <- file.path(outdir, "random_seq")
fastainfo_df <- RandomSeq(fastainfo = fastainfo_df, # input df (output of Merge)
fasta_dir = merged_dir, # directory of input fasta files.
outdir = randomseq_dir, # directory of output fasta files.
use_vsearch = FALSE, # use cross-platform version
n = 40000, # select n sequences from each file
compress = FALSE) # uncompressed output
#> Warning: vtamR_demo_out/merged/mfzr_1_fw.fasta contains 37351 sequences.
#> The input file is copied to output.
A warning indicates that the file
mfzr_1_fw.fasta contains fewer than 40,000 sequences. Since
sampling is performed without replacement, this file is simply copied to
the output directory and will contain slightly fewer reads than the
other two files.
Demultiplex and Trim
The SortReads function will demultiplex the fasta files according to the tag combinations and trim
the primers from the reads.
See the help (?SortReads) for setting the correct
parameters for demultiplexing and primer trimming:
check_reverse is TRUE, SortReads checks
the reverse complementary stand as well.sampleinfo_df is updated version of
fastainfo_df. This data frame and the files listed in it
are the input of the Dereplicate.demultiplexed_dir <- file.path(outdir, "demultiplexed")
sampleinfo_df <- SortReads(fastainfo=fastainfo_df, # input df (output of RandomSeq)
fasta_dir=randomseq_dir, # input directory
outdir=demultiplexed_dir, # output directory
check_reverse=TRUE,
cutadapt_path=cutadapt_path, # can be omitted if CUTADAPT is in the PATH
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
cutadapt_minimum_length = 170,
cutadapt_maximum_length = 185,
compress=FALSE)
At this stage, you should have one FASTA file per sample-replicate, containing merged reads with tags and primers removed.
These FASTA files should now be dereplicated, and
each unique sequence (ASV) will be assigned a
numerical ID (asv_id). The result is a data
frame (read_count_df) that can also be
written to a CSV file if outfile is
specified. Saving this file is recommended, as it serves as the starting
point for filtering. You can test different filtering strategies
starting from this file without needing to re-run the time-consuming merging, demultiplexing,
and trimming steps.
If you wish to retain ASV IDs from a previous analysis, you can
provide a reference file (asv_list) containing
earlier ASVs and their IDs. The complete, updated list of ASVs will then
be written to output_asv_list.
If only output_asv_list is provided (and
input_asv_list is omitted), the function will write the
full list of current ASVs with assigned IDs. This will be useful when
analyzing further data sets to ensure consistent ASV IDs across
studies.
outfile <- file.path(outdir, "filter", "1_before_filter.csv")
updated_asv_list <- file.path(outdir, "ASV_list_with_IDs.csv")
read_count_df <- Dereplicate(sampleinfo_df, # input df (output of SortReads)
dir=demultiplexed_dir, # directory with input fasta files
outfile=outfile, # write the output data frame to this file
input_asv_list=asv_list,
output_asv_list=updated_asv_list)
vtamR has a large number of functions to filter
out ASVs or occurrences of ASVs in a given
sample-replicate, which are the results of technical or
biological problems associated with metabarcoding.
The different filters
PoolFilters
function (accepting only occurrences that pass all filters)In this section, I give an example using most filtering options. It is up to you to construct your own pipeline.
Each function returns a data frame with the filtered
output. You can also write the results to a csv
file (using the outfile argument). This can be
useful for tracking the presence/absence of some of the ASVs/samples
throughout the analyses using the HistoryBy and
SummarizeBy functions.
I will start the names of the output files by a number to keep the order of the different steps and allow the SummarizeBy function to make final statistics at each step of the analyses.
We can also follow the evolution of the number of reads and
ASVs remaining in the data set after each filtering steps.
Let’s define a data frame (stat_df), that will contain this
information, and we will complete it after each step.
stat_df <- data.frame(parameters=character(),
asv_count=integer(),
read_count=integer(),
sample_count=integer(),
sample_replicate_count=integer())
Then we add a line with information on the unfiltered data set.
stat_df <- GetStat(read_count_df, stat_df, stage="Input", params=NA)
params: String containing the major parameters of each
filtering steps.stage: String referring to the filtering step.SWARM (Mahé et al., 2015) is a powerful and quick computer program to denoise the data set by clustering ASVs that are likely to be the result of sequencing errors to ASVs representing real biological sequences. It will considerably reduce the number of ASVs.
SWARM can be run from the ClusterASV function of
vtamR. For more detailed information on
ClusterASV, see the Clustering or denosing section.
When the aim is to denoise the sequencing reads and not to cluster
valid ASVs to mOTUs, SWARM should be run with d=1 and the
fastidious algorithm. (See details in SWARM’s
Documentation)
It is possible to run SWARM separately for each sample
(by_sample=TRUE) or all at once for the whole data set.
This second option is more efficient in reducing the number of ASVs but
can cluster very similar real sequences appearing in different
samples.
When interested in intra-specific variability, I suggest
by_sample=TRUE or skipping this step, since SWARM can
cluster together very similar real ASV.
The group argument pools all ASVs belonging to the same
swarm into one line under the asv_id of the swarm’s
centroid, summing read counts across ASVs while keeping the sample and
replicate structure unchanged.
by_sample <- TRUE
d=1
fastidious= TRUE
outfile <- file.path(outdir, "filter", "2_Swarm_by_sample.csv")
read_count_df <- ClusterASV(read_count_df,# input df
method = "swarm",
swarm_d=d,
fastidious=fastidious,
by_sample=by_sample,
group = TRUE,
path=swarm_path, # can be omitted if SWARM is in the PATH
outfile=outfile) # write the output data frame to this file
stat_df <- GetStat(read_count_df, stat_df, stage="SWARM", params=by_sample)
Let’s check the reduction of the number of variants after running
swarm.
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
Although, SWARM has reduced considerably the number of ASVs, there
are still many ASVs with low read count. The
LFNglobalReadCount function filter out all ASVs with total
read count bellow a threshold.
Let’s eliminate singletons and see how the number of ASV and reads are have changed.
global_read_count_cutoff = 2
outfile <- file.path(outdir, "filter", "3_LFNglobalReadCount.csv")
read_count_df <- LFNglobalReadCount(read_count_df,
cutoff=global_read_count_cutoff,
outfile=outfile)
stat_df <- GetStat(read_count_df,
stat_df,
stage="LFNglobalReadCount",
params=global_read_count_cutoff)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
This filter is applicable only for coding sequences. The idea is that if the length of the ASV differs from the most frequent ASV length by a number that is not a multiple of 3, it is likely to be an erroneous sequence or a pseudo-gene.
outfile <- file.path(outdir, "filter", "4_FilterIndel.csv")
read_count_df <- FilterIndel(read_count_df,
outfile=outfile)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterIndel")
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
This filter is applicable only for coding sequences. It checks the number of codon STOPs in all 3 reading frames, and eliminates ASVs with STOP in all of them.
The numerical id of the genetic code can be chosen from NCBI. By default, it is the invertebrate mitochondrial genetic code (5), since the STOP codons of this genetic codes are also STOP codons in almost all genetic codes.
outfile <- file.path(outdir, "filter", "5_FilterCodonStop.csv")
genetic_code = 5
read_count_df <- FilterCodonStop(read_count_df,
outfile=outfile,
genetic_code=genetic_code)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilerCodonStop",
params=genetic_code)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
We start from the hypothesis that, in cases of slight external contamination, contaminant DNA may be amplified in negative control samples, since there is no competition with abundant target DNA. Conversely, in real or mock samples, the amplification of contaminant DNA is usually less efficient due to this competition.
This filter identifies ASVs whose read counts are higher in at least one negative control than in any other sample. These ASVs are considered potential contaminants and are removed from the data set.
The conta_file will contain the data corresponding to
the removed ASVs. It is a good idea to check its contents to ensure that
nothing looks abnormal. For example, in our case, 8 ASVs were removed.
However, they all contain only a few reads, so this likely reflects
random low-frequency noise rather than genuine contamination.
conta_file <- file.path(outdir, "filter", "external_contamination.csv")
outfile <- file.path(outdir, "filter", "6_FilterCodonStop.csv")
read_count_df <- FilterExternalContaminant(read_count_df,
sampleinfo=sampleinfo_df,
conta_file=conta_file,
outfile=outfile)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterExternalContaminant",
params=NA)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
This function will run the uchime3_denovo function
implemented in VSEARCH.
FilterChimera will eliminate ASVs detected as chimeras when running
on the whole data set at once (by_sample=FALSE). If
by_sample=TRUE, ASVs are eliminated if they have been
identified as chimeras in at least sample_prop proportion
of the samples where they are present.
abskew: A chimera must be at least abskew
times less frequent that the parental ASVs.
abskew=2
by_sample = T
sample_prop = 0.8
outfile <- file.path(outdir, "filter", "7_FilterChimera.csv")
read_count_df <- FilterChimera(read_count_df,
outfile=outfile,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
by_sample=by_sample,
sample_prop=sample_prop,
abskew=abskew)
params <- paste(abskew, by_sample, sample_prop, sep=";")
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterChimera",
params=params)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
Let’s eliminate aberrant replicates that are not similar to other replicates of the same sample.
We will calculate the renkonen distances among all pairs of replicates within sample.
renkonen_within_df <- MakeRenkonenDistances(read_count_df, compare_all=FALSE)
Let’s make a density plot of the renkonen distances.
Note: This is a very small test data set, so the graph you generate with your own data will likely look quite different.
DensityPlot_RenkonenDistance(renkonen_within_df)
We can also make a barplot of the renkonen distances, and use different colors for different sample types (real/mock/negative).
Note:
sampleinfo_df contains a sample and
sample_type columns, but any data frame with these columns
would do.SortReads
function, andsampleinfo_df is not in your environment, just use the
sampleinfo file (in the vtamR_demo_out/demultiplexed/
directory).Barplot_RenkonenDistance(renkonen_within_df, sampleinfo=sampleinfo_df, x_axis_label_size=6)
In this example, there are only a few distance values, so the density
plot is not very informative. The bar plot shows that replicate 1 of the
tnegtag sample (a negative control) differs substantially
from the other two replicates, and setting cutoff = 0.4
appears reasonable as a cutoff for the
FilterRenkonen function.
This command filters out replicates whose Renkonen distances exceed
the cutoff relative to most other replicates of the same sample. In this
case, replicate 1 of the tnegtag sample will be
removed.
An alternative way to determine the cutoff value is to set the
renkonen_distance_quantile parameter instead of specifying
a fixed cutoff. This parameter automatically sets the
cutoff at the limit defined by the chosen quantile of the within-sample
distance distribution. For example, setting
renkonen_distance_quantile = 0.9 will set the
cutoff between the 9th and 10th decile (i.e., the 90th
percentile) of the distance distribution.
Let’s choose this option.
outfile <- file.path(outdir, "filter", "8_FilterRenkonen.csv")
renkonen_distance_quantile <- 0.9
read_count_df <- FilterRenkonen(read_count_df,
outfile=outfile,
renkonen_distance_quantile=renkonen_distance_quantile)
#> [1] "The cutoff value for Renkonen distances is 0.295077972709552"
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilerRenkonen",
params=renkonen_distance_quantile)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
If you have done the denoising step by SWARM
(ClusterASV), this function is probably redundant.
FilterPCRerror filters out an ASV if it is very similar
to another, more frequent ASV. max_mismatch argument tells
how many mismatches are allowed between a pair of ASV to be evaluated
for PCR error. If the proportion of the read_counts of two similar ASVs
is less or equal to pcr_error_var_prop, the less abundant
ASV is flagged as a PCR error.
OptimizePCRerror
The default value of pcr_error_var_prop (0.1) is
arbitrary. To choose a value adapted to your data set, you can use the
OptimizePCRerror function.
OptimizePCRerror, will find all highly similar ASV pairs
(max_mismatch=1 by default) within a mock sample, where one
ASV is expected, and the other is not. Their read_count
ratio is printed in the output file.
The function considers only ASVs with more than
min_read_count reads in the sample to avoid a ratio based
on low read counts that are more influenced by stochastic events and
will be probably filtered out anyway.
See details using the help: ?OptimizePCRerror.
Notes:
max_mismatch=2 in this example, to have some
output in my very small (and artificial) data set, but I recommend you
to use max_mismatch=1.outfile <- file.path(outdir, "optimize", "OptimizePCRerror.csv")
OptimizePCRerror_df <- OptimizePCRerror(read_count_df,
mock_composition=mock_composition,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile,
max_mismatch=2,
min_read_count=5)
Let’s look at the beginning of the output. The lines are sorted in
decreasing order of the pcr_error_var_prop column, so the
first few entries are sufficient to inspect.
knitr::kable(head(OptimizePCRerror_df), format = "markdown")
| sample | expected_read_count | unexpected_read_count | pcr_error_var_prop | expected_asv_id | unexpected_asv_id | expected_asv | unexpected_asv |
|---|---|---|---|---|---|---|---|
| tpos1 | 963 | 11 | 0.0114226 | 5 | 5746 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | CCTTTATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
| tpos1 | 12169 | 12 | 0.0009861 | 6 | 6022 | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGATCATGGCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
| tpos1 | 12169 | 12 | 0.0009861 | 6 | 2015 | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTTATACCTATTTTAATT |
It seems that SWARM has done a good job. The highest read count ratio
is 0.011 in the output of OptimizePCRerror. This should be
taken as a lower limit to pcr_error_var_prop.
FilterPCRerror
Let’s use 0.05 for filtering out PCR errors sample by sample.
See the help for more detail ?FilterPCRerror.
Note: Again, I use max_mismatch <- 2
in this example, since I have used this value for
OptimizePCRerror, but I recommend you to use
max_mismatch <- 1.
pcr_error_var_prop <- 0.05
max_mismatch <- 2
outfile <- file.path(outdir, "filter", "9_FilterPCRerror.csv")
read_count_df <- FilterPCRerror(read_count_df,
outfile=outfile,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
pcr_error_var_prop=pcr_error_var_prop,
max_mismatch=max_mismatch)
params <- paste(pcr_error_var_prop, max_mismatch, by_sample, sep=";")
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterPCRerror",
params=params)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
LFNsampleReplicate will eliminate occurrences with very
low read counts compared to the total number of reads in the
sample-replicate. The default cutoff proportion is 0.001. We can have an
idea of the maximum value of this cutoff, by examining the proportions
of expected ASVs in the mock samples using the
OptimizeLFNsampleReplicate function.
OptimizeLFNsampleReplicate
Note: If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.
outfile = file.path(outdir, "optimize", "OptimizeLFNsampleReplicate.csv")
OptimizeLFNsampleReplicate_df <- OptimizeLFNsampleReplicate(read_count=read_count_df,
mock_composition=mock_composition,
outfile=outfile)
knitr::kable(head(OptimizeLFNsampleReplicate_df), format = "markdown")
| sample | replicate | action | asv_id | lfn_sample_replicate_cutoff | read_count | read_count_sample_replicate | asv | taxon |
|---|---|---|---|---|---|---|---|---|
| tpos1 | 3 | keep | 1 | 0.0047 | 98 | 20458 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
| tpos1 | 1 | keep | 3 | 0.0084 | 186 | 22032 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 3 | 0.0087 | 239 | 27374 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 1 | 0.0089 | 246 | 27374 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
| tpos1 | 3 | keep | 3 | 0.0094 | 193 | 20458 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 5 | 0.0119 | 326 | 27374 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | Synorthocladius semivirens |
The lines of OptimizeLFNsampleReplicate_df are sorted in
increasing order of the lfn_sample_replicate_cutoff column.
The lowest proportion of read count of an expected ASV to the total read
count of its sample-relicate is 0.0047 (Sequence of Caenis
pusilla in the replicate 3 of tpos1 mock sample). A cutoff values
for LFNsampleReplicate higher than this will eliminate some
of the expected occurrences and create false negatives. Therefore, we
choose 0.004 as a cutoff.
Note:
If the different species in the mock community are amplified equally
well, the minimum value of lfn_sample_replicate_cutoff may
be relatively high. In such cases, it is often better to use an
arbitrarily low value (e.g., 0.001–0.005). The optimal threshold depends
strongly on the complexity of your environmental samples: the more taxa
you expect, the lower this cutoff should be.
In practice, strong PCR amplification bias is very common, so situations where all mock species amplify equally well are rare.
When designing mock communities for a metabarcoding experiment, it is advisable to include some species known to amplify poorly. This requires an optimization and testing beforehand, but it ensures that your mock sample reflects realistic amplification biases before you process your actual environmental samples.
LFNsampleReplicate
lfn_sample_replicate_cutoff <- 0.004
outfile <- file.path(outdir, "filter", "10_LFNsampleReplicate.csv")
read_count_df <- LFNsampleReplicate(read_count_df,
cutoff=lfn_sample_replicate_cutoff,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="LFNsampleReplicate",
params=lfn_sample_replicate_cutoff)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
| LFNsampleReplicate | 0.004 | 51 | 113238 | 4 | 11 |
To ensure repeatability, occurrences are retained only if they appear
in at least min_replicate_number replicates of the sample
(2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "filter", "11_FilterMinReplicate.csv")
read_count_df <- FilterMinReplicate(read_count_df,
cutoff=min_replicate_number,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterMinReplicate",
params=min_replicate_number
)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
| LFNsampleReplicate | 0.004 | 51 | 113238 | 4 | 11 |
| FilterMinReplicate | 2 | 32 | 111797 | 4 | 11 |
The LFNvariant filter removes occurrences with very low
read counts relative to the total number of reads of their ASV. By
default, the cutoff proportion is 0.001.
This filter is designed to remove spurious detections caused by tag-jumps or light cross-sample contamination.
The LFNreadCount filter, on the other hand, removes
occurrences with absolute read counts below a fixed threshold, which is
set to 10 by default.
Choosing the best cutoff values
The function OptimizeLFNreadCountLFNvariant helps to
determine the most appropriate cutoff values for both filters. It
evaluates a range of cutoff combinations and counts the resulting number
of false positives, false
negatives, and true positives.
To run this optimization, you need a known_occurrences_df data
frame that identifies occurrences known to be TPs (true
positives) or FPs (false positives).
There are two ways to obtain the known_occurrences_df
data frame:
1. Full control
You can generate it manually by running the MakeKnownOccurrences
function. You may then review or edit the resulting
known_occurrences_df as needed before passing it as an
input to OptimizeLFNreadCountLFNvariant.
2. Automatic generation
Alternatively, you can run
OptimizeLFNreadCountLFNvariant and set
known_occurrences = NULL (the default), but provide values
for the mock_composition, sampleinfo, and
habitat_proportion arguments. In this case,
OptimizeLFNreadCountLFNvariant will automatically call
MakeKnownOccurrences and use its output to estimate the
optimal cutoff values.
All data frames generated by
OptimizeLFNreadCountLFNvariant and
MakeKnownOccurrences run from it are written as CSV files
to the specified outdir.
The main output is OptimizeLFNreadCountLFNvariant_df,
which summarizes the performance of the different cutoff combinations.
This table is also saved in outdir.
The rows are ordered by:
lfn_variant_cutoff,lfn_read_count_cutoff.This way, the first line corresponds to the parameter combination
that yields the lowest number of false negatives and false positives,
while also using the least stringent values for
lfn_variant_cutoff and
lfn_read_count_cutoff.
Let’s use the second approach, where the
known_occurrences_df is generated
automatically by
OptimizeLFNreadCountLFNvariant. If you prefer to create it
manually, see the MakeKnownOccurrences section.
Here we will use the default range of cutoffs to test, but you can
set the minimum, the maximum and the increment for the cutoff values for
both filters. (see ?OptimizeLFNreadCountLFNvariant). We set
min_replicate_number to 2, to eliminate non-repeatable
occurrences among the three replicates of each sample. This is the
equivalent of FilterMinReplicate filter.
outdir = file.path(outdir, "OptimizeLFNreadCountLFNvariant")
OptimizeLFNreadCountLFNvariant_df <- OptimizeLFNreadCountLFNvariant(
read_count_df,
mock_composition = mock_composition,
sampleinfo = sampleinfo_df,
habitat_proportion = 0.5,
known_occurrences=NULL,
outdir= outdir,
min_replicate_number=2
)
knitr::kable(head(OptimizeLFNreadCountLFNvariant_df), format = "markdown")
| lfn_read_count_cutoff | lnf_variant_cutoff | FN | TP | FP |
|---|---|---|---|---|
| 10 | 0.001 | 0 | 6 | 4 |
| 15 | 0.001 | 0 | 6 | 4 |
| 20 | 0.001 | 0 | 6 | 4 |
| 25 | 0.001 | 0 | 6 | 4 |
| 30 | 0.001 | 0 | 6 | 4 |
| 35 | 0.001 | 0 | 6 | 4 |
LFNvariant, LFNreadCount
From the output, we choose 0.001 for the cutoff of
LFNvariant and 10 for LFNreadCount, since this
is the less stringent combination that keeps all expected occurrences (6
TP, 0 FN), and has less FP (4 FP).
We will run the two filters on the same input data frame (for which
the parameters has been optimized), and pool the results by accepting
only occurrences that pass both filters by the PoolFilters
function.
See ?LFNvariant, ?LFNreadCount and
?PoolFilters for details.
LFNvariant
lnf_variant_cutoff = 0.001
outfile <- file.path(outdir, "filter", "12_LFNvariant.csv")
read_count_df_lnf_variant <- LFNvariant(read_count_df,
cutoff=lnf_variant_cutoff,
outfile=outfile)
stat_df <- GetStat(read_count_df_lnf_variant,
stat_df,
stage="LFNvariant",
params=lnf_variant_cutoff)
LFNreadCount
lfn_read_count_cutoff <- 10
outfile <- file.path(outdir, "filter", "13_LFNreadCount.csv")
read_count_df_lfn_read_count <- LFNreadCount(read_count_df,
cutoff=lfn_read_count_cutoff,
outfile=outfile)
stat_df <- GetStat(read_count_df_lfn_read_count,
stat_df, stage="LFNreadCount",
params=lfn_read_count_cutoff)
Combine results
outfile <- file.path(outdir, "filter", "14_poolLFN.csv")
read_count_df <- PoolFilters(read_count_df_lfn_read_count,
read_count_df_lnf_variant,
outfile=outfile)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterLFN")
# delete temporary data frames
rm(read_count_df_lfn_read_count)
rm(read_count_df_lnf_variant)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
| LFNsampleReplicate | 0.004 | 51 | 113238 | 4 | 11 |
| FilterMinReplicate | 2 | 32 | 111797 | 4 | 11 |
| LFNvariant | 0.001 | 32 | 111733 | 4 | 11 |
| LFNreadCount | 10 | 31 | 111762 | 4 | 11 |
| FilterLFN | NA | 31 | 111714 | 4 | 10 |
Let’s run again FilterMinReplicate to ensure
repeatability among replicates of the sample (2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "filter", "15_FilterMinReplicate.csv")
read_count_df <- FilterMinReplicate(read_count_df,
cutoff=min_replicate_number,
outfile=outfile)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterMinReplicate",
params=min_replicate_number)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
| LFNsampleReplicate | 0.004 | 51 | 113238 | 4 | 11 |
| FilterMinReplicate | 2 | 32 | 111797 | 4 | 11 |
| LFNvariant | 0.001 | 32 | 111733 | 4 | 11 |
| LFNreadCount | 10 | 31 | 111762 | 4 | 11 |
| FilterLFN | NA | 31 | 111714 | 4 | 10 |
| FilterMinReplicate1 | 2 | 30 | 111594 | 3 | 9 |
Run MakeKnownOccurrences again to get performance
metrics (FP, FN, FP). We will write the output data frames to files as
well. The performance_metrics file will give you the count
of FP, FN, TP, Precision and Sensitivity. You can find false negatives in
missing_occurrences, and true and false positives in
known_occurrences.
missing_occurrences <- file.path(outdir, "Missing_occurrences.csv")
performance_metrics <- file.path(outdir, "Performance_metrics.csv")
known_occurrences <- file.path(outdir, "Known_occurrences.csv")
results <- MakeKnownOccurrences(read_count_df,
sampleinfo=sampleinfo_df,
mock_composition=mock_composition,
known_occurrences=known_occurrences,
missing_occurrences=missing_occurrences,
performance_metrics=performance_metrics)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]
knitr::kable(performance_metrics_df, format = "markdown")
| TP | FP | FN | Precision | Sensitivity |
|---|---|---|---|---|
| 6 | 4 | 0 | 0.6 | 1 |
If you have already assigned ASV to taxa, (before making the
mock_composition file) you can use the same output
(asv_tax data frame) and skip this step.
The TaxAssignLTG function is a BLAST base Last Common
Ancestor method. See the brief description of the algorithm with
?TaxAssignLTG.
For the format of taxonomy and blast_db
check the Reference database
for taxonomic assignments section. You can also download a ready to
use COI database (See Installation).
outfile <- file.path(outdir, "TaxAssignLTG.csv")
asv_tax <- TaxAssignLTG(asv=read_count_df,
taxonomy=taxonomy,
blast_db=blast_db,
blast_path=blast_path, # can be omitted if BLAST is in the PATH
outfile=outfile)
If you are not interested in intra-specific variation, the ASV can be clustered to mOTUs.
The ClusterASV function supports two algorithms for
clustering ASVs:
method = "vsearch": This approach uses VSEARCH’s
cluster_size command, a greedy, centroid-based algorithm
that ranks ASVs by decreasing read count. As a result, cluster centroids
correspond to the most abundant ASVs. The cluster radius is controlled
by the identity parameter.
method = "swarm": SWARM is a fast, iterative
clustering method that groups ASVs differing by d or fewer
nucleotides. Clusters form around local abundance peaks and are largely
unaffected by the order of input sequences.
The main parameters are identity for VSEARCH and
swarm_d for SWARM. In this tutorial, we’ll focus on
determining an appropriate identity threshold for
clustering with VSEARCH, but a similar approach applies to SWARM.
To choose a suitable threshold, you can visualize the distribution of
pairwise identities between ASVs—comparing pairs within the same cluster
versus those from different clusters—using the function
PairwiseIdentityPlotPerClusterIdentityThreshold (for
VSEARCH). If you’re clustering with SWARM, see
?PairwiseIdentityPlotPerSwarmD for the equivalent plot.
For more detailed information, see the Clustering or denosing section.
outfile <- file.path(outdir, "cluster", "pairwise_identities_vsearch.csv")
plotfile <- file.path(outdir, "cluster", "PairwiseIdentityPlotPerClusterIdentityThreshold.png")
plot_vsearch <- PairwiseIdentityPlotPerClusterIdentityThreshold(
read_count_df,
identity_min=0.9,
identity_max=0.99,
identity_increment=0.01,
min_id = 0.8,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile,
plotfile=plotfile)
print(plot_vsearch)
Note: This graph is based on a very
small test data set and is therefore not informative. Below is
the graph obtained using the complete (though still relatively small)
original data set.
Clusters can also be classified based on the taxonomic assignments of the ASVs they contain:
closed: All ASVs in the cluster are assigned to the
same taxon, and every ASV belonging to that taxon is found exclusively
within this cluster.open: All ASVs in the cluster share the same taxonomic
assignment, but additional ASVs of that taxon appear in other
clusters.hybrid: The cluster includes ASVs assigned to multiple
taxa.This classification can be performed after clustering ASVs across a
range of parameter values for identity (VSEARCH) or
swarm_d (SWARM). The number of clusters in each category
can then be plotted for each parameter setting. A clustering parameter
that maximizes the number of closed clusters while
minimizing the number of open and hybrid ones
is often a good candidate for defining mOTUs.
(See ?PlotClusterClassification for details on setting
parameters when using SWARM.)
outfile <- file.path(outdir, "cluster", "classification_vsearch.csv")
plotfile <- file.path(outdir, "cluster", "PlotClusterClasstification_vsearch.png")
scatterplot_vsearch <- PlotClusterClasstification(
read_count=read_count_df,
taxa=asv_tax,
clustering_method="vsearch",
cluster_params=c(0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99),
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
taxlevels= c("species", "genus"),
outfile=outfile,
plotfile=plotfile)
print(scatterplot_vsearch)
Note: again, this graph is based on a very
small test data set and is therefore not informative. Below is
the graph obtained using the complete (though still relatively small)
original data set.
Even the original data set is relatively small, and the graphs do not show strong differences between parameter values. Nevertheless, several conclusions can be drawn:
closed clusters at the species level is
highest between 0.94 and 0.97, while the numbers of
open and hybrid clusters remain relatively low
in this range.The ClusterASV function can perform clustering using
either VSEARCH (method = "vsearch") or SWARM
(method = "swarm").
The output format is controlled by the group
argument:
group = TRUE: ASVs belonging to the same cluster are
aggregated into a single row. In this case, the asv_id and
asv columns contain the identifier and sequence of the
cluster’s centroid. Read counts are summed across ASVs, while sample and
replicate structure remain unchanged.group = FALSE: The function returns the original ASV
table with an additional cluster_id column, so each row
still represents a single ASV.Clustering can be performed on the entire data set or separately for
each sample. When the goal is to define mOTUs, it is recommended to
cluster all ASVs together (by_sample = FALSE).
Attention! If clustering is done by_sample
(by_sample = TRUE) and group==FALSE, the same
asv_id, can have different cluster_id in
different samples, therefore this combination is not recommended.
Below, we’ll cluster the data using VSEARCH with an identity
threshold of 0.97. (See ?ClusterASV for
guidance on adjusting parameters when using SWARM.)
outfile <- file.path(outdir, "filter", "16_mOTU_vsearch.csv")
identity <- 0.97
read_count_samples_df_ClusterSize <- ClusterASV(
read_count_df,
method = "vsearch",
identity = 0.97,
group = TRUE,
by_sample=FALSE,
path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile)
stat_df <- GetStat(read_count_samples_df_ClusterSize,
stat_df,
stage="ClusterASV_vsearch",
param=identity)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| Input | NA | 10067 | 116445 | 4 | 12 |
| SWARM | TRUE | 1115 | 116445 | 4 | 12 |
| LFNglobalReadCount | 2 | 150 | 115436 | 4 | 12 |
| FilterIndel | NA | 146 | 115404 | 4 | 12 |
| FilerCodonStop | 5 | 146 | 115404 | 4 | 12 |
| FilterExternalContaminant | NA | 134 | 115296 | 4 | 12 |
| FilterChimera | 2;TRUE;0.8 | 128 | 115270 | 4 | 12 |
| FilerRenkonen | 0.9 | 128 | 115243 | 4 | 11 |
| FilterPCRerror | 0.05;2;TRUE | 120 | 115188 | 4 | 11 |
| LFNsampleReplicate | 0.004 | 51 | 113238 | 4 | 11 |
| FilterMinReplicate | 2 | 32 | 111797 | 4 | 11 |
| LFNvariant | 0.001 | 32 | 111733 | 4 | 11 |
| LFNreadCount | 10 | 31 | 111762 | 4 | 11 |
| FilterLFN | NA | 31 | 111714 | 4 | 10 |
| FilterMinReplicate1 | 2 | 30 | 111594 | 3 | 9 |
| ClusterASV_vsearch | 0.97 | 26 | 111594 | 3 | 9 |
The WriteASVtable function restructures the
read_count_df data frame so that samples are
represented in columns and ASVs (or mOTU
centroids) appear in rows—converting the table from long format to wide
format.
Additional information can be included in the output:
asv_tax)add_sums_by_asv)add_sums_by_sample)add_expected_asv)add_empty_samples)If pool_replicates = TRUE, replicates within each sample
are pooled, and the read count for each sample is calculated as the mean
of non-zero read counts across replicates.
For further details, see ?WriteASVtable.
outfile=file.path(outdir, "Final_asvtable_with_TaxAssignLTG.csv")
asv_table_df <- WriteASVtable(read_count_df,
outfile=outfile,
asv_tax=asv_tax,
sampleinfo=sampleinfo_df,
pool_replicates=TRUE,
add_empty_samples=T,
add_sums_by_sample=T,
add_sums_by_asv=T,
add_expected_asv=T,
mock_composition=mock_composition)
Print the number of reads, ASVs, samples and replicates after each step to a CSV file.
write.csv(stat_df, file = file.path(outdir, "Summary.csv"))
Most vtamR functions can accept either a data
frame or a CSV file as input. This is a
convenient feature that allows you to resume your analysis without
rerunning previous steps — for example, if you have closed your R
session but saved the data frame to a file.
Below is a non-exhaustive list of arguments that can take either a data frame or a CSV file, as long as they contain the required columns:
read_countfastqinfofastainfosampleinfoinput_asv_listmock_compositionasvltg_paramsasv_taxSee the help of the functions to check for the required columns:
?function_name
Similarly, many functions return a data frame, and if outfile is specified at the same time, the function will also write the data frame to a CSV file.
Saving outputs as CSV files is a good practice for long workflows, as it allows you to inspect the data set after each filtering step and resume processing later if needed.
During the first steps of the workflow (Merge,
RandomSeq, TrimPrimers,
SortReads), many intermediate files are produced. Some of
these can be very large, especially before
dereplication (Dereplicate).
File Size vs. Processing Time
Compressing and decompressing files between steps can take
significant time. For this reason, vtamR outputs
uncompressed files by default to improve speed. However, this
can quickly consume disk space, so remember to delete
intermediate files once your analysis is complete. See Files to keep for guidance on which files to
retain.
If storage space is limited, you can enable file compression by
setting compress = TRUE in the relevant functions (e.g.,
RandomSeq, TrimPrimers,
SortReads). This will produce gzip-compressed
output files.
Compression Methods
The compression and decompression method is defined by the argument
compress_method. Three options are available:
"pigz" – Uses the PIGZ
program (parallel gzip). Requires installation and, if not in the system
PATH, its location must be provided via
pigz_path. Available on Linux-like systems and Windows.
Uses multithreading, making it the fastest
option.
"gzip" – Uses the standard
gzip command available on Linux-like systems.
Slower than pigz and not available
on Windows.
"R" – Fully implemented in R via
the R.utils package.
Cross-platform, works everywhere, but is the
slowest method.
Example: Compression Speed Comparison
The table below shows an example of compression and decompression times for a 19 GB FASTA file (compressed size: 3 GB) on a Linux system.
| Method | Compression (min) | Decompression (min) |
|---|---|---|
| R.utils | 16 | 3 |
| gzip | 13 | 3 |
| pigz | 3 | 2 |
Summary:
PIGZ is clearly the best option.Notes on Compatibility
The effect of compression may vary depending on which external programs are used:
The ClusterASV function supports two algorithms for
grouping ASVs:
method = "vsearch"
Uses VSEARCH’s cluster_size command, a greedy
centroid-based algorithm. ASVs are sorted by decreasing read count, so
the most abundant ASVs become cluster centroids. The cluster radius is
controlled by the identity parameter.
method = "swarm"
Uses SWARM, a fast, iterative algorithm that clusters ASVs differing
by d or fewer nucleotides. Clusters naturally form around
local abundance peaks and are mostly insensitive to input order.
Although both approaches perform clustering, they are typically used to achieve two different goals.
1. Denoising
SWARM was originally developed as a denoising method for metabarcoding data sets. The underlying assumption is:
True biological ASVs tend to have high read counts, while very similar ASVs with lower read counts are likely sequencing errors.
With this in mind, denoising attempts to group erroneous sequences into the cluster of their true ASV.
Using SWARM with d = 1 and with the fastidious
option activated corresponds to this denoising aim. The output still
represents ASV-level diversity, not mOTUs.
Limitation: If two real biological variants differ by only one mutation, they may be merged into the same cluster, causing a loss of true variability. However, SWARM denoising is extremely powerful for removing spurious sequences and is recommended early in the bioinformatic pipeline — unless very fine-scale intrataxon variation is the focus (e.g., biogeographic studies).
2. Clustering
In contrast, clustering traditionally aims to group both correct and erroneous ASVs into larger units approximating biological taxa (often species or genera).
The simplest clustering approaches rely on a fixed identity
threshold. This is the case for the VSEARCH
cluster_size method.
Although clustering is not SWARM’s original purpose, larger
d values can also produce clusters that approximate
taxonomic units.
Choosing an appropriate threshold is difficult, since it depends on
both the genetic marker and the taxonomic group. To guide this choice,
vtamR provides tools to examine clustering results across
multiple thresholds.
1. Pairwise Identity Plots
For a series of thresholds, vtamR can produce density
plots of pairwise identities:
PairwiseIdentityPlotPerClusterIdentityThreshold()
(for VSEARCH thresholds)
PairwiseIdentityPlotPerSwarmD() (for SWARM
d values)
These plots show how ASV similarities vary within and between clusters.
2. Cluster Classification
Clusters can also be classified based on their taxonomic assignments:
The function PlotClusterClassification() displays the
number of clusters in each class for a range of thresholds.
Interpreting the Plots
Together, these visualisations help identify an appropriate clustering threshold for your specific dataset. They allow you to evaluate:
d values influence cluster
structure,By default, vtamR automatically uses all
available CPU cores for multithreaded processes. If you want to
take full advantage of your computer’s resources, there is no
need to specify the num_threads argument.
However, if you wish to limit the number of cores used (for example, to leave resources available for other tasks or users), you can specify this parameter manually in the functions.
Example: This will restrict vtamR to using 8 CPU threads during running Merge.
fastainfo_df <- Merge(fastqinfo,
fastq_dir =fastq_dir,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outdir =merged_dir,
num_threads =8)
Some vtamR functions rely on third-party programs (e.g., VSEARCH, CUTADAPT, BLAST, SWARM, PIGZ). To run these tools, vtamR needs to know where to find their executables.
If these programs are already available in your system PATH, you don’t need to specify their locations when calling vtamR functions.
However, if they are not in your PATH, you must
provide their paths explicitly using the corresponding arguments:
vsearch_path, cutadapt_path,
pigz_path, blast_path, or
swarm_path.
To check which external programs a function uses, refer to its help
page (?function_name).
Some external programs produce a large amount of output on the screen, even when run in their own “quiet” mode. This can make it difficult to see important warnings or error messages.
The quiet argument controls whether this output is
displayed. By default, quiet = TRUE, which suppresses most
messages for cleaner logs.
If something goes wrong, re-run the function with
quiet = FALSE to view the full output and help identify the
problem.
Separator used in csv files. Comma by default. Use the same separator in all CSV files except for the taxonomy file of the reference database, which is tab separated.
At the first steps (Merge, RandomSeq,
Sortreads, TrimPrimer), many files are
produced, and some of them can be very large, especially before the
dereplication (Dereplicate).
Compressing and decompressing between the different steps can take a
while. Therefore, vtamR uses by default uncompressed files.
This behavior can be changed by the compress
option.
Here is a list of files I would tend to keep, delete or compress.
Merge: Delete the output fasta files, but keep
fastainfo.csv for information on read counts after
Merge.RandomSeq: Compress fasta files, but keep them to
ensure reproducibility in case of re-analyse.SortReads: Delete the output fasta files, but keep
sampleinfo.csv for information on read counts after
SortReads.Dereplicate: Write the dereplicated info to a file
(1_before_filter.csv in the tutorial) and compress it after
the analyses. This is an important file, since you can restart the
analyses from here, without re-doing the longest merging, and
demultiplexing steps. Compress and keep the updated ASV list
(ASV_list_with_IDs.csv in the tutorial), since this can be
used when analyzing subsequent data sets to synchronyse the
asv_ids.The RandomSeq function randomly selects n
sequences (without replacement) from each input FASTA file listed in the
fastainfo data frame.
This step can be useful to normalize the number of reads across
sequencing libraries. RandomSeq can be applied before or
after demultiplexing (SortReads).
However, if your workflow includes LFNvariant or
LFNreadCount, which partially rely on the total number of
reads in negative controls, standardizing read numbers across samples is
not recommended. For this reason, it is generally best to use
RandomSeq after Merge and before
SortReads, in order to ensure that all replicate series
(i.e., the same samples with different replicates in separate FASTA
files) contain the same number of reads.
There are two different methods to perform random sampling:
1. Using VSEARCH
(use_vsearch = TRUE)
Random sampling can be done with the fastx_subsample
command from VSEARCH. This method is very fast but only available on
Linux-like operating systems.
If compression of input/output files is required, three compression methods are supported:
pigz – Fastest method.
Requires the PIGZ program (if not in your
PATH, specify its location via
pigz_path).gzip – Slower method. Works
only on Linux-like systems.R – Slowest but cross-platform
method. Uses the R.utils package, no external
dependencies.2. Using R Only
(use_vsearch = FALSE)
Random sampling can also be performed entirely in R. This version is slower but works on all operating systems and does not depend on any third-party software for file compression or decompression.
fastainfo: A CSV file or data frame
containing metadata with the following columns: tag_fw,
primer_fw, tag_rv, primer_rv,
sample, sample_type, habitat,
replicate, fasta.fasta_dir: Directory containing the
input FASTA files.n: Number of sequences to sample
(without replacement).outdir: Directory where the output
FASTA files will be written.Fast version (Linux-like systems)
The following example uses VSEARCH and PIGZ for fast compression:
randomseq_dir <- file.path(outdir, "random_seq")
fastainfo_df <- RandomSeq(
fastainfo = fastainfo_df,
fasta_dir = merged_dir,
outdir = randomseq_dir,
use_vsearch = TRUE,
vsearch_path = vsearch_path, # can be omitted if VSEARCH is in the PATH
compress_method = "pigz",
pigz_path = pigz_path,
n = 10000,
compress = TRUE)
This example runs entirely in R and works on any operating system:
randomseq_dir <- file.path(outdir, "random_seq")
fastainfo_df <- RandomSeq(
fastainfo = fastainfo_df,
fasta_dir = merged_dir,
outdir = randomseq_dir,
use_vsearch = FALSE,
n = 10000,
compress = TRUE)
The MakeKnownOccurrences function will identify false positive, false
negative and true positive occurrences in
controls samples (mock and negative). Some false positives can also be
identified in real samples if samples of different habitats are included
in the data sets.
The false positive and the true positive occurrences are written to
the known_occurrences data frame (or file), false negatives
to the missing_occurrences data frame (or file), and a
performance_metrics data frame is also produced with the
count of these occurrences. For details see
?MakeKnownOccurrences.
This function takes a read_count_df data frame as an
input.
Note:
sample and
sample_type replicate and habitat
columns, but any data frame with these columns would do.SortReads
function, andsampleinfo_df is not in your environment, just use the
sampleinfo file (in the vtamR_demo_out/demultiplexed
directory).# Detect known occurrences
results <- MakeKnownOccurrences(read_count = read_count_df,
sampleinfo=sampleinfo_df,
mock_composition=mock_composition)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]
In some data sets, certain ASVs may occur in many samples with high
read counts.
As a result, reads originating from tag-jump or inter sample
contamination may remain after applying a fixed cutoff value in the
LFNvariant function.
The ASVspecificCutoff function computes ASV-specific
cutoff values for use in LFNvariant,
targeting only ASVs known to have false-positive occurrences (as
identified by
MakeKnownOccurrences).
For each ASV, the function:
by_replicate = TRUE.Because some false positives may not result from tag-jump
contamination, the resulting
cutoff values can sometimes be excessively high. Therefore:
max_cutoff parameter.asv_spec <- ASVspecificCutoff(read_count_df,
max_cutoff=0.05,
mock_composition=mock_composition_df)
The command above produces a data frame (asv_spec)
containing
asv_id and cutoff_asv_spec columns. Since
max_cutoff = 0.05, any cutoff_asv_spec values
greater than 0.05 have been adjusted to 0.05.
The asv_spec data frame contains cutoff values only for
ASVs with known false-positive occurrences.
These ASV-specific cutoff values can then be used with the
LFNvariant function to filter occurrences. If
LFNvariant has already been applied with a fixed
cutoff, it is sufficient to provide the
asv_spec data frame. The filtering will be applied only to
the variants listed in asv_spec.
read_count_df <- LFNvariant(read_count_df,
asv_specific_cutoffs = asv_spec,
min_read_count_prop=0.7)
A warning is issued by the above command, if for some variants, less
than min_read_count_prop of the reads are retained after
filtering. This indicates that the cutoff value for those variants was
likely too high and should be lowered.
It is also possible to use both the asv_specific_cutoff
and cutoff arguments in LFNvariant. In this
case, the global cutoff is applied to all ASVs that do not
have an ASV-specific cutoff defined in asv_spec, as well as
to ASVs whose ASV-specific cutoff is lower than the global
cutoff.
read_count_df <- LFNvariant(read_count_df,
cutoff=0.002,
asv_specific_cutoffs = asv_spec,
min_read_count_prop=0.7)
Taxonomic assignment using the RDP classifier
For 16S rRNA data, the RDP (Ribosomal
Database Project) provides a curated and well-maintained
taxonomic reference database. The pre-trained set for
Bacteria and Archaea 16S sequences is
included in the rRDPData package, while the
rRDP package performs rapid taxonomic assignment of these
markers.
Installing rRDP and
rRDPData
These packages are available through Bioconductor,
but are not installed automatically when installing vtamR.
You can install them as follows:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rRDP")
BiocManager::install("rRDPData")
Running the RDP classifier via
vtamR
The TaxAssignRDP() function in vtamR is a wrapper around
the RDP classifier, making it easy to apply on your ASV data set.
taxa <- TaxAssignRDP(
asv=read_count_df,
max_memory = 8,
confidence = 0.8,
rm_chloroplast = TRUE)
Arguments
asv: Either a CSV file path or a data frame containing
at least two columns: asv_id(unique numeric identifier) and
asv (nucleotide sequence).max_memory: Amount of RAM available for RDP (in
GB).confidence: Minimum confidence threshold for taxonomic
assignment.rm_chloroplast: If TRUE, lineages of the
sequences assigned to Chloroplast are replaced with
NA in the taxonomy table.Instead of the default training set from rRDPData, you
can specify a directory containing a custom RDP-trained data
set created with the trainRDP() function.
taxa <- TaxAssignRDP(
asv=read_count_df,
dir="path/to/custom_RDP_trained_DB"
max_memory = 8,
confidence = 0.8,
rm_chloroplast = TRUE)
See ?rRDP::rdp for more details on training and using
custom classifiers.
This function scans all files in the dir that starts by
a number. (See file names of the output files of the different filtering steps). It will select all lines were the
feature
(asv_id/asv/sample/replicate)
has a value we are looking for.
Examples
Get the history of asv_id (feature) 27
(value).
filter_dir <- file.path(outdir, "filter")
tmp_ASV_27 <- HistoryBy(dir=filter_dir,
feature="asv_id",
value="27")
knitr::kable(head(tmp_ASV_27), format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 12_LFNvariant.csv | 27 | tpos1 | 1 | 176 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 12_LFNvariant.csv | 27 | tpos1 | 2 | 341 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 12_LFNvariant.csv | 27 | tpos1 | 3 | 103 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 13_LFNreadCount.csv | 27 | tpos1 | 1 | 176 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 13_LFNreadCount.csv | 27 | tpos1 | 2 | 341 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 13_LFNreadCount.csv | 27 | tpos1 | 3 | 103 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
Get the history of a sequence of the ASV
(feature).
tmp_replicate_1 <- HistoryBy(
dir=filter_dir,
feature="asv",
value="ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC")
knitr::kable(tmp_replicate_1, format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 12_LFNvariant.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_LFNvariant.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_LFNvariant.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_LFNreadCount.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_LFNreadCount.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_LFNreadCount.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_poolLFN.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_poolLFN.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_poolLFN.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 15_FilterMinReplicate.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 15_FilterMinReplicate.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 15_FilterMinReplicate.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 16_mOTU_vsearch.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 16_mOTU_vsearch.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 16_mOTU_vsearch.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
Get the history of the sample (feature) tpos1
(value).
tmp_sample_tpos1 <- HistoryBy(dir=filter_dir,
feature="sample",
value="tpos1")
knitr::kable(head(tmp_sample_tpos1), format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 12_LFNvariant.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_LFNvariant.csv | 1 | tpos1 | 2 | 246 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_LFNvariant.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_LFNvariant.csv | 2 | tpos1 | 1 | 2130 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
| 12_LFNvariant.csv | 2 | tpos1 | 2 | 1917 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
| 12_LFNvariant.csv | 2 | tpos1 | 3 | 649 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
This function scans all files in the dir that starts by
a number. (See file names of the output files of the different filtering steps).
It will group each file by a variable
(asv/asv_id/sample/replicate)
and summarize a feature
(asv/asv_id/sample/replicate/read_count).
If the feature is read_count, it will give the
sum of the read counts for each value of the variable in each file.
Otherwise, it returns the number of distinct values of the
feature for each value of the variable in each file.
Examples
Get the number of reads of each sample after each filtering steps.
From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and there is a considerable variation among the number of reads of different real samples.
read_count_by_sample <- SummarizeBy(dir=filter_dir,
feature="read_count",
grouped_by="sample")
knitr::kable(read_count_by_sample, format = "markdown")
| sample | 12_LFNvariant | 13_LFNreadCount | 14_poolLFN | 15_FilterMinReplicate | 16_mOTU_vsearch |
|---|---|---|---|---|---|
| 14ben01 | 25725 | 25725 | 25725 | 25725 | 25725 |
| 14ben02 | 17724 | 17724 | 17724 | 17724 | 17724 |
| tnegtag | 139 | 168 | 120 | 0 | 0 |
| tpos1 | 68145 | 68145 | 68145 | 68145 | 68145 |
Get the number asv for each sample after each filtering steps.
From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and the mock samples has 10 ASVs at the end.
asv_by_sample <- SummarizeBy(dir=filter_dir,
feature="asv",
grouped_by="sample")
knitr::kable(asv_by_sample, format = "markdown")
| sample | 12_LFNvariant | 13_LFNreadCount | 14_poolLFN | 15_FilterMinReplicate | 16_mOTU_vsearch |
|---|---|---|---|---|---|
| 14ben01 | 3 | 3 | 3 | 3 | 3 |
| 14ben02 | 19 | 19 | 19 | 19 | 17 |
| tnegtag | 5 | 4 | 3 | 0 | 0 |
| tpos1 | 10 | 10 | 10 | 10 | 10 |
Get the number asv_id for each replicate after each filtering steps.
We can see that number of ASVs are comparable in different replicates.
asvid_by_replicate <- SummarizeBy(dir=filter_dir,
feature="asv_id",
grouped_by="replicate")
knitr::kable(asvid_by_replicate, format = "markdown")
| replicate | 12_LFNvariant | 13_LFNreadCount | 14_poolLFN | 15_FilterMinReplicate | 16_mOTU_vsearch |
|---|---|---|---|---|---|
| 1 | 27 | 27 | 27 | 27 | 23 |
| 2 | 30 | 29 | 29 | 28 | 24 |
| 3 | 28 | 26 | 26 | 26 | 24 |
The UpdateASVlist function combines unique
asv–asv_id pairs from two input data frames or
CSV files.
If any inconsistencies are detected within or between the input data
sets, the function stops and returns an error message. Otherwise, it
writes the merged list of asv–asv_id pairs to
the specified output file.
This function ensures consistency between asv_ids used
in previous and current data sets.
The safest approach is to provide both the
input_asv_list and output_asv_list arguments
directly to the Dereplicate function, as in the tutorial. This approach automatically
synchronizes asv_ids between the current and earliers data
sets (input_asv_list) and generates an updated ASV list
that includes all ASVs from both earlier and current data sets
(output_asv_list). In this case, you do
not need to call UpdateASVlist manually —
it is automatically invoked within Dereplicate.
However, when processing a large number of extensive data sets, the complete ASV list can become very large and may cause memory issues. Because most ASVs in this list are singletons that are later filtered out, you can choose a lighter workflow:
input_asv_list in the Dereplicate
function to align asv_ids with previous runs
without writing an output_asv_list.LFNglobalReadCount), update the ASV list to include
only frequent ASVs that are more likely to reappear in future data
sets.This strategy remains safe and substantially reduces the number of ASVs stored in the list file.
asv_list <- "PATH/TO/CSV/WITH/Earlier_asv-asv_id_pairs.csv"
updated_asv_list <- file.path(outdir, "updated_ASV_list.csv")
UpdateASVlist(
asv_list1 = read_count_df,
asv_list2 = asv_list,
outfile = updated_asv_list)
Replicates are used to make sure the repetability and reduce
experimental fluctuations, but after the analyse they can be pooled by
sample. It is called automatically from WriteASVtable, if
pool_replicates=TRUE.
PoolReplicates function will take the mean non-zero read
counts of each ASV over replicates of the same sample.
outfile <- file.path(outdir, "PooledReplicates.csv")
read_count_samples_df <- PoolReplicates(read_count_df,
outfile=outfile)
More than one overlapping marker
This function pools different data sets and it is particularly
useful, if the results should be pooled from more than one overlapping
markers. In that case, ASVs identical on their overlapping regions are
pooled into groups, and different ASVs of the same group are represented
by their centroid (longest ASV of the group). Pooling can take the mean
read counts of the ASVs (mean_over_markers=T; default) or
their sum (mean_over_markers=F).
The function takes several input CSV files, each in long format containing asv_id,
sample, read_count and asv
columns. The file names should be organized in data frame, with the
marker names for each file.
outfile: Name of the output CSV file with the pooled
data set (asv_id, sample,
read_count, asv). ASVs are grouped to the same
line if identical in their overlapping region, and only the centroids
appear in the asv column.asv_with_centroids: Name of the output CSV file
containing each of the the original ASVs (with samples, markers, and
read_count) as well as their centroids.outfile.files <- data.frame(file=c("vtamR_test/out_mfzr/15_PoolReplicates.csv",
"vtamR_test/test/15_PoolReplicates_ZFZR.csv"),
marker=c("MFZR",
"ZFZR"))
outfile <- file.path(outdir, "Pooled_datasets.csv")
asv_with_centroids <- file.path(outdir,
"Pooled_datasets_asv_with_centroids.csv")
read_count_pool <- PoolDatasets(files,
outfile=outfile,
asv_with_centroids=asv_with_centroids,
mean_over_markers=T,
vsearch_path=vsearch_path) # can be omitted if VSEARCH is in the PATH
Only one marker
Pooling the results of different data sets of the same marker is very
simple. Basically, the input files (in long
format with asv, asv_id,
sample, read_count columns) are concatenated.
The PoolDatasets function also checks if sample names are
unique. If not, it sums the read count of the same sample and same ASV,
but returns a warning.
The output file or data frame can be rearranged to wide format by the WriteASVtable function.
Count the number of reads in fasta or fastq
files found in the input directory. Input files can be gz compressed or
uncompressed, but zip files are not supported.
The fastainfo and the sampleinfo files contain the number of reads
after Merge or SortReads, so no need to run
CountReadsDir separately. This function can be useful for
counting the number of reads in the input fastq files.
dir: Input directory containing the fasta of fastq
filesfile_type: [fasta/fastq]pattern: Regular expression; Check only files for
pattern in the file namedf <- CountReadsDir(dir=fastq_dir,
pattern="_fw.fastq.gz",
file_type="fastq")
Make a bar plot of read counts by sample
(sample_replicate=F) or sample-replicate (sample_replicate=T).
Can use different colors for different sample types (real/mock/negative)
sampleinfo: data frame or CSV file containing info on
sample types for each samplesampleinfo <- file.path(demultiplexed_dir, "sampleinfo.csv")
Barplot_ReadCountBySample(read_count_df=read_count_df,
sample_replicate=F,
sampleinfo=sampleinfo)
Calculate the Renkonen distances among all replicates (compare_all=TRUE) or among replicates of the same sample (compare_all=FALSE) and plot them.
sampleinfo: Data frame or CSV file containing info on
sample types for each sample.sampleinfo <- file.path(demultiplexed_dir, "sampleinfo.csv")
renkonen_within_df <- MakeRenkonenDistances(read_count_df,
compare_all=FALSE)
Barplot_RenkonenDistance(renkonen_within_df,
sampleinfo=sampleinfo)
DensityPlot_RenkonenDistance(renkonen_within_df)
renkonen_all_df <- MakeRenkonenDistances(read_count_df,
compare_all=TRUE)
DensityPlot_RenkonenDistance(renkonen_all_df)
CSV file with information on input fastq files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with information on input fasta files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with information on demultiplexed and primer trimmed fasta files and samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with the following columns.
CSV file or data frame with the following columns.
This file lists ASVs and their IDs from earlier data sets. When provided (optional), identical ASVs in the present and earlier data sets have the same ID. New ASVs (not present in asv_list) will get unique IDs not present in asv_list. It is a CSV file with the following columns:
Data frame with the following columns:
A data base is composed of two elements. A BLAST database
(blast_db) and a taxonomy file.
blast_db can be produced using the
makeblastdb command of BLAST:
makeblastdb -dbtype nucl -in [FASTA_FILE] -parse_seqids -taxid_map [TAXID_FILE] -out [DB_NAME]
taxonomy is a tab separated csv file
with the following columns:
A ready to use COI database in BLAST format and the
associated taxonomy file can be downloaded from https://osf.io/vrfwz/. It was created
using mkCOInr. It is
also possible to make a customized
database using mkCOInr. It can be dowloaded and extracted manually
of using the download_osf function of vtamR. See Installation
read_count_df contains one line for each occurrence with
asv_id, asv, sample,
replicate, read_count columns. This is the
standard format of keeping occurrences throughout the analyses, and it
is smaller than the wide format, if there are many ASVs present in only
one or few samples.read_count_df can be rearranged in wide format, where lines
are ASVs, columns are sample(-replicates) and cells contain read counts.
It is a more human friendly format and it is the base of writing an ASV
table, where this information can be completed by taxonomic assignments
and other information.asv, sample,
and read_count. This format is tidy and works well with
many tidyverse functions.vtamR wide format data
frames, each row is an ASV, and each column represents a sample. Read
counts are in the cells.LFNvariant will filter out all occurrences where
(number of reads of the ASV in a sample-replicate) / (total number of read of the ASV) < cutoff
As a consequence, if an ASV is present in most samples, there are many samples in the data set and the cutoff is relatively high, most occurrences can fall bellow the cutoff and the total read count of the ASV will decrease strongly at this step.
If the proportion of the read count of an ASV in the output compared
to the input is less than min_read_count_prop, vtamR prints
out a Warning. Take a close look at the variant. It can be the result of
a general contamination, and in this case, the whole ASV could/should be
eliminated from the data set. Otherwise, you might want to reduce the
cutoff value for LFNvariant.
Possible workarounds:
by_sample=T).global_read_count first to eliminate singletons
then swarm.Both solutions make swarm a bit less efficient, but at least your analyses can get through.
If input fastq files are very large and do not go through
Merge, Try working with uncompressed input files.
If the amplicon is shorter than the reads, use
fastq_allowmergestagger=T.