vtamR is an advanced, R-based revision of VTAM (Validation and Taxonomic Assignment of Metabarcoding Data). It provides a comprehensive metabarcoding pipeline, enabling:
Novelties Compared to VTAM
vtamR is highly adaptable, allowing users
to include, exclude, or reorder analysis steps as needed.Please, follow the Installation instructions.
library(vtamR)
If the third-party programs are not in your
PATH (see Installation),
adjust the xxx_path variables according to
your setup.
# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
blast_path <- "C:/Users/Public/blast-2.16.0+/bin/blastn"
swarm_path <- "C:/Users/Public/swarm-3.1.5-win-x86_64/bin/swarm"
pigz_path <- "C:/Users/Public/pigz-win32/pigz" # optional
# Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt"
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch"
blast_path <- "~/miniconda3/envs/vtam/bin/blastn"
swarm_path <- "swarm" # swarm is in the PATH
pigz_path <- "pigz" # optional; pigz is in the PATH
If the third-party programs are in your
PATH, you have two options:
xxx_path variables as shown below,
and run the tutorial examples exactly as written.# All executables are in the PATH
cutadapt_path <- "cutadapt"
vsearch_path <- "vsearch"
blast_path <- "blastn"
swarm_path <- "swarm"
pigz_path <- "pigz"
xxx_path arguments at all,
and simply omit them when calling the vtamR functions.For example, run:
read_count_df <- filter_chimera(read_count_df)
instead of:
read_count_df <- filter_chimera(
read_count_df,
vsearch_path = vsearch_path
)
Dataset Composition:
Sequences are typically derived from 96 samples, corresponding to a 96-well plate used for PCR amplification.
Each plate includes negative controls, mock samples, and biological samples. A single sequencing library is generated per replicate, with each library representing a set of 96 PCR reactions.
While vtamR can analyze datasets without
controls or replicates, its full potential is realized when
these features are utilized:
For projects exceeding 96 samples, analyses are conducted plate by plate.
pool_dataset() function for integrated analysis.Multi-Marker Analysis
vtamR supports the analysis of the same samples using
multiple, strongly overlapping markers. For example,
our dataset originates from a project Corse
et al., 2017 utilizing two markers: MFZR and
ZFZR.
Both markers target overlapping regions of the COI gene, where MFZR starts 18 nucleotides upstream of ZFZR.
The demo files provided in the vtamR package
include:
for each marker-plate combination. Due to the limited sample size, numerical values and graphs may not be highly informative.
Results from different marker-plate combinations are stored in separate directories, each maintaining the same structure and filenames. These data sets are processed independently using the same pipeline.
We will first analyze the first “plate” of the MFZR marker, then demonstrate how to pool results from different plates and markers.
Let’s begin by setting the output directory for plate 1 of MFZR.
outdir <- "vtamR_demo_out/mfzr_plate1"
Loading Data
To load the demo files included in
vtamR, we will use the system.file()
function. For your own datasets, specify the file and directory paths
(e.g., ~/vtamR/fastq).
fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <- system.file("extdata/demo/fastqinfo_mfzr_plate1.csv", package = "vtamR")
mock_ncbi_fasta <- system.file("extdata/demo/mock_ncbi.fasta", package = "vtamR")
asv_list <- system.file("extdata/demo/asv_list.csv", package = "vtamR")
fastq_dir: Directory containing the
input FASTQ files.fastqinfo:
CSV file with metadata for input files, including primers, tags, and
sample information.mock_ncbi_fasta: FASTA file containing
COI sequences of species present in the mock samples.
mock_composition file that
defines the expected content of mock samples.asv_list:
(Optional) CSV file containing ASVs and asv_ids
from previous data sets.To perform taxonomic assignment, set the path to:
blast_db).taxonomy).For this example, we use a small demo database
included in the vtamR package.
To obtain a full reference database (for real data), visit the TaxAssign Reference Database section.
When using your own data, provide the file names directly,
without using the system.file()
function.
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
The check_file_info() function:
It is optional, but useful, as user-generated files may contain subtle errors that are difficult to identify manually.
check_file_info(file=fastqinfo, dir=fastq_dir, file_type="fastqinfo")
#> [1] "Sample names are alphanumerical : OK"
#> [1] "Only IUPAC characters in tag_fw: OK"
#> [1] "Only IUPAC characters in tag_rv: OK"
#> [1] "Only IUPAC characters in primer_fw: OK"
#> [1] "Only IUPAC characters in primer_rv: OK"
#> [1] "Coherence between samples and sample_type: OK"
#> [1] "Coherence between samples and habitat: OK"
#> [1] "sample_type: OK"
#> [1] "Coherence between samples and replicates: OK"
#> [1] "File extension: OK"
#> [1] "Coherence between fw and rv fastq filename: OK"
#> [1] "Unique tag combinations : OK"
check_file_info(file=asv_list, file_type="asv_list")
#> [1] "Only IUPAC characters in ASV: OK"
#> [1] "1 to 1 relation between asv_id and asv : OK"
Before beginning your analysis, review the Tips – Good to Know section. This section highlights key features and options of
vtamR, helping you run the functions more effectively.
Your FASTQ files can vary depending on your lab and sequencing setup. They might include:
In this tutorial, our FASTQ files contain reads from several samples. So, we’ll need to:
merge_fastq_pairs.subsample_fasta.demultiplex_and_trim.See the From fastq to data frame on how to deal with other experimental designs.
The merge_fastq_pairs function combines each read-pair
into a single sequence in fasta format and can also handle
quality filtering.
compress = FALSE: In this tutorial,
we’ll work with uncompressed files. For guidance on whether to compress
your files, check out the To
Compress or Not to Compress section.
Managing Large Files: Once your analysis is complete, you might want to compress or delete large intermediate files. For tips on this, see Files to Keep.
More Options: For additional quality filtering
options, take a look at ?merge_fastq_pairs.
Key Files
merge_fastq_pairs. It’s an updated
version of fastqinfo, where FASTQ filenames are replaced
with FASTA filenames, and read counts are included for each file.merged_dir <- file.path(outdir, "1_merged")
fastainfo_df <- merge_fastq_pairs(
fastqinfo=fastqinfo,
fastq_dir=fastq_dir, # directory of input fasta files
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outdir=merged_dir, # directory of output fasta files
fastq_maxee=1, # Maximum Expected Error for each read
fastq_maxns=0, # no Ns in output
fastq_allowmergestagger=T # reads might be longer than the marker
)
The subsample_fasta function lets you randomly pick
n sequences (without replacement) from each FASTA file
listed in your fastainfo data frame.
This step is great for balancing read counts across your sequencing libraries. It’s optional, but can be really helpful if you want to make sure all your replicates have a similar number of reads.
Two ways to do random sampling
use_vsearch = TRUE):
use_vsearch = FALSE):
I’m using the slower but cross-platform R-only
version. For more details, check out the subsample_fasta
documentation.
randomseq_dir <- file.path(outdir, "2_random_seq")
fastainfo_df <- subsample_fasta(
fastainfo = fastainfo_df, # input df (output of merge_fastq_pairs)
fasta_dir = merged_dir, # directory of input fasta files.
outdir = randomseq_dir, # directory of output fasta files.
use_vsearch = FALSE, # use cross-platform version
n = 40000, # select n sequences from each file
randseed=123 # A given non-zero seed produces always the same result.
)
#> Warning: vtamR_demo_out/mfzr_plate1/1_merged/mfzr_1_fw.fasta contains 37351 sequences.
#> The input file is copied to output.
You may encounter a warning indicating that the file
mfzr_1_fw.fasta contains fewer than 40,000 sequences. Since
subsampling is performed without replacement, this file
will be copied directly to the output directory. As a result, it will
contain slightly fewer reads than the other files — this is expected and
not a cause for concern.
The demultiplex_and_trim function is used to both
demultiplex your FASTA files
(i.e. separate reads based on their tag combinations) and trim tag and
primer sequences from those reads.
You can check the help page (?demultiplex_and_trim) to
choose the right settings for your data.
check_reverse = TRUE, the function will also look
for matches on the reverse complement strand.sampleinfo_df is an updated version of
fastainfo_df.dereplicate step.demultiplexed_dir <- file.path(outdir, "3_demultiplexed")
sampleinfo_df <- demultiplex_and_trim(
fastainfo=fastainfo_df, # input df (output of subsample_fasta)
fasta_dir=randomseq_dir, # input directory with fasta files
outdir=demultiplexed_dir, # output directory
check_reverse=TRUE,
cutadapt_path=cutadapt_path, # can be omitted if CUTADAPT is in the PATH
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
cutadapt_minimum_length = 170,
cutadapt_maximum_length = 185
)
At this stage, you should have one FASTA file per sample-replicate, containing merged reads with tags and primers already removed.
The next step is to dereplicate these files. This means grouping identical sequences together so that each unique sequence (an ASV) is assigned a unique numerical ID (an asv_id).
Output
The output is a data frame called read_count_df, which can also be saved as a
CSV file if you provide the outfile
argument. Saving this file is strongly recommended—it becomes your
starting point for filtering. This way, you can try different filtering
strategies without having to re-run the more time-consuming
pre-processing steps.
ASV Identifiers
If you want to keep ASV IDs consistent with a previous
analysis, you can provide a reference file (asv_list) containing existing ASVs and their IDs.
The updated and complete list of ASVs will then be written to
output_asv_list.
If you only provide output_asv_list (without an
input_asv_list), the function will simply write out the
full list of current ASVs with their assigned IDs. This is useful if you
plan to analyze additional datasets later and want to keep ASV IDs
consistent across studies.
outfile <- file.path(outdir, "4_filter", "1_before_filter.csv")
updated_asv_list <- file.path(outdir, "ASV_list_with_IDs.csv")
read_count_df <- dereplicate(
sampleinfo_df, # input df (output of demultiplex_and_trim)
dir=demultiplexed_dir, # directory with input fasta files
outfile=outfile, # write the output data frame to this file
input_asv_list=asv_list,
output_asv_list=updated_asv_list
)
vtamR provides many functions to filter out
ASVs or ASV occurrences within a
sample-replicate that may result from technical or biological
issues in metabarcoding.
There are two main ways to apply these filters:
pool_filters function. In this case, only occurrences that
pass all filters are kept.In this section, I’ll show an example that uses several filtering options. However, you are encouraged to build your own pipeline based on your specific needs.
Output data frame and CSV
Each filtering function returns a data frame with the
filtered results. You can also save the output as a CSV
file using the outfile argument. This is
especially useful if you want to track how ASVs or samples are retained
or removed throughout the analysis, using the history_by and summarize_by functions.
To keep things organized, I recommend starting output file names with a number to reflect the order of the steps. This also allows the summarize_by function to easily generate summary statistics at each stage of the analysis.
Following the evolution of the number of reads and ASVs
(stat_df)
You can also track how the number of reads and ASVs
changes after each filtering step. For this, we will create a
data frame (stat_df) that stores this information and
update it along the way.
First, we create a data frame to store summary statistics at each step of the analysis:
stat_df <- data.frame(
parameters = character(),
asv_count = integer(),
read_count = integer(),
sample_count = integer(),
sample_replicate_count = integer()
)
Next, we add a first row with information from the unfiltered data set:
stat_df <- get_stat(read_count_df, stat_df, stage = "input", params = NA)
Here’s what the main arguments mean:
params: a string describing the key parameters used at
each filtering stepstage: a string indicating which step of the analysis
the statistics correspond toIn metabarcoding datasets, sequencing errors often produce many low-frequency ASVs that do not reflect real biological sequences. A common way to reduce this noise is to use SWARM (Mahé et al., 2015), a fast and widely used algorithm that groups highly similar ASVs together.
Running SWARM in vtamR
You can run SWARM directly with the denoise_by_swarm()
function.
Since the goal here is denoising (and not clustering into mOTUs), the recommended settings are:
swarm_d = 1: allows only one difference between
ASVsfastidious = TRUE: adds a second pass to refine
clusters and reduce small onesThese settings are designed to remove sequencing errors while preserving real biological variation.
There are two ways to apply SWARM:
by_sample = FALSE (default): SWARM is applied to the
entire data set. This approach is more aggressive and efficiently
reduces the total number of ASVs.
by_sample = TRUE: SWARM is applied separately to
each sample. This is more conservative and helps preserve biological
variation.
If you are interested in intra-specific variation,
it is generally better to use by_sample = TRUE, as very
similar but real ASVs in different samples might otherwise be
merged.
split_clustersEven with d = 1 and fastidious = TRUE,
SWARM may still group ASVs that differ by only one nucleotide but have
similar read abundances. This can lead to real
biological variants being merged into a single cluster.
To address this, denoise_by_swarm() includes an optional
post-processing step with split_clusters = TRUE. This step
attempts to recover meaningful variants that may have been incorrectly
merged.
For each cluster:
ASVs are evaluated based on two criteria:
min_abundance_ratio)min_read_count)ASVs that meet both criteria are considered biologically meaningful, and each forms a new cluster
The total reads from the original cluster are then redistributed among these ASVs, while keeping their original proportions.
If no ASV meets the criteria, the cluster remains unchanged (all reads are pooled together).
The split_clusters option should be used with the
following settings:
by_sample = TRUEswarm_d = 1 (default)fastidious = TRUE (default)split_clusters <- TRUE
by_sample <- TRUE
outfile <- file.path(outdir, "4_filter", "2_swarm_denoise.csv")
read_count_df <- denoise_by_swarm(
read_count_df,
outfile=outfile,
by_sample=by_sample,
split_clusters=split_clusters,
swarm_path=swarm_path,
quiet=TRUE
)
param <- paste("by_sample:", by_sample, "split_clusters:", split_clusters, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="swarm_denoise", params=param)
Let’s check the reduction of the number of variants after running
swarm.
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
The small variation in the total number of reads comes from rounding when redistributing reads among clusters that have been split.
filter_asv_global)Although SWARM has already reduced the number of ASVs considerably, many ASVs with low read counts may still remain.
The filter_asv_global function removes all ASVs whose
total read count is below a chosen threshold.
Here, we will eliminate singletons and observe how the number of ASVs and total reads changes after this step.
global_read_count_cutoff = 2
outfile <- file.path(outdir, "4_filter", "3_filter_asv_global.csv")
read_count_df <- filter_asv_global(
read_count_df,
outfile=outfile,
cutoff=global_read_count_cutoff
)
param <- paste("global_read_count_cutoff:", global_read_count_cutoff, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_asv_global", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
filter_indel)This filter is only applicable to coding sequences.
The idea is simple: if the length of an ASV differs from the most common ASV length by a value that is not a multiple of 3, it is likely to be an erroneous sequence or a pseudogene.
outfile <- file.path(outdir, "4_filter", "4_filter_indel.csv")
read_count_df <- filter_indel(
read_count_df,
outfile=outfile
)
stat_df <- get_stat(read_count_df, stat_df, stage="filter_indel", params=NA)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
filter_stop_codon)This filter is only applicable to coding sequences.
It checks for the presence of STOP codons in all three possible reading frames and removes ASVs that contain STOP codons in each of them.
You can specify the genetic code using its numerical ID from NCBI. By default, the function uses the invertebrate mitochondrial genetic code (code 5), as its STOP codons are also valid STOP codons in most other genetic codes.
outfile <- file.path(outdir, "4_filter", "5_filter_stop_codon.csv")
genetic_code = 5
read_count_df <- filter_stop_codon(
read_count_df,
genetic_code=genetic_code,
outfile=outfile
)
param <- paste("genetic_code:", genetic_code, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_stop_codon", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
filter_contaminant)This filter detects ASVs whose read counts are higher in at least one negative control than in any real sample. These ASVs are considered potential contaminants and are removed from the dataset.
The conta_file records the data for the ASVs that were
removed. It’s a good idea to check this file to make
sure nothing unusual is happening.
In our case, 8 ASVs were removed. However, each of them had only a few reads, suggesting this was likely just random low-frequency noise rather than true contamination.
outfile <- file.path(outdir, "4_filter", "6_filter_contaminant.csv")
conta_file <- file.path(outdir, "4_filter", "potential_contaminants.csv")
read_count_df <- filter_contaminant(
read_count_df,
sampleinfo=sampleinfo_df,
conta_file=conta_file,
outfile=outfile
)
stat_df <- get_stat(read_count_df, stat_df, stage="filter_contaminant", params=NA)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
filter_chimera)This function uses the uchime3_denovo algorithm
implemented in VSEARCH to detect chimeras.
by_sample = FALSE (default), ASVs identified as
chimeras across the whole dataset are removed.by_sample = TRUE, an ASV is removed only if it is
detected as a chimera in at least a proportion of samples defined by
sample_prop.The abskew parameter sets the minimum frequency
difference: a chimera must be at least abskew
times less abundant than its parental ASVs to be flagged.
outfile <- file.path(outdir, "4_filter", "7_filter_chimera.csv")
abskew=2
by_sample = T
sample_prop = 0.8
read_count_df <- filter_chimera(
read_count_df,
vsearch_path=vsearch_path,
outfile=outfile,
by_sample=by_sample,
sample_prop=sample_prop,
abskew=abskew
)
param <- paste("by_sample:", by_sample, "sample_prop:", sample_prop, "abskew:", abskew, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_chimera", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
filter_replicate)This filter removes replicates that are very different from other replicates of the same sample.
To do this, we calculate the Renkonen distances between all pairs of replicates within each sample and identify those that deviate significantly from the others.
renkonen_within_df <- compute_renkonen_distances(read_count_df, compare_all=FALSE)
We can create a density plot of the Renkonen distances to see how similar replicates are within each sample.
Note: This example uses a very small test dataset, so your own data will likely produce a different-looking graph.
plot_renkonen_distance_density(renkonen_within_df)
We can also create a barplot of the Renkonen distances,
using different colors to represent sample types (real,
mock, or negative).
Notes:
sampleinfo_df must have sample and
sample_type columns, but any data frame with these columns
will work.demultiplex_and_trim and sampleinfo_df is not
in your environment, you can simply use the sample info
file from
vtamR_demo_out/mfzr_plate1/3_demultiplexed/sampleinfo.csv.plot_renkonen_distance_barplot(
renkonen_within_df,
sampleinfo=sampleinfo_df,
x_axis_label_size=6
)
In this example, there are only a few Renkonen distance values, so
the density plot isn’t very informative. The
barplot, however, shows that replicate 1 of the
tnegtag sample (a negative control) is
substantially different from the other two replicates. Setting
cutoff = 0.4 would be a reasonable choice for the
filter_replicate function.
The filter_replicate function removes replicates whose
Renkonen distances exceed the cutoff relative to most
other replicates of the same sample. Here, this means that
replicate 1 of tnegtag will be
removed.
An alternative way to define the cutoff is to use the
renkonen_distance_quantile parameter instead of a fixed
cutoff. This automatically sets the cutoff based on a
chosen quantile of the within-sample distance
distribution. For example:
renkonen_distance_quantile = 0.9 sets the cutoff at the
90th percentile of distances, i.e., between the 9th and
10th decile.In this tutorial, we will use the quantile-based option to determine the cutoff.
outfile <- file.path(outdir, "4_filter", "8_filter_replicate.csv")
renkonen_distance_quantile <- 0.9
read_count_df <- filter_replicate(
read_count_df,
renkonen_distance_quantile = renkonen_distance_quantile,
outfile = outfile
)
#> [1] "The cutoff value for Renkonen distances is 0.289308176100629"
param <- paste("renkonen_distance_quantile:", renkonen_distance_quantile, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_replicate", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
mock_composition FileThe mock_composition file is needed to suggest parameter values for some filtering steps. It should list the expected ASVs in each mock sample.
If you don’t know the exact sequences of the species in your mock samples, see How to make a mock_composition file for guidance.
Here’s a quick example of how to create it. After generating the
file, check
mock_composition_template_to_check.csv to
make sure all ASVs have been correctly identified.
outdir_mock <- file.path(outdir, "mock_composition")
mock_template <- match_variants_to_mock_species(
read_count=read_count_df,
fas=mock_ncbi_fasta, # fasta with reference sequences on the mock or closely related species
taxonomy=taxonomy,
sampleinfo = sampleinfo_df,
outdir= outdir_mock,
blast_path=blast_path
)
mock_composition <- file.path(outdir_mock, "mock_composition_template_to_check.csv")
mock_composition_df <- read.csv(mock_composition)
knitr::kable(mock_composition_df, format = "markdown")
| sample | action | asv | taxon | asv_id |
|---|---|---|---|---|
| tpos1 | keep | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT | Baetis rhodani | 5 |
| tpos1 | keep | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla | 1 |
| tpos1 | keep | CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT | Hydropsyche pellucidula | 3 |
| tpos1 | keep | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus | 6 |
| tpos1 | keep | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT | Rheocricotopus chalybeatus | 2 |
| tpos1 | keep | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | Synorthocladius semivirens | 4 |
filter_occurrence_sample)The filter_occurrence_sample function removes
ASV occurrences with very low read counts relative to
the total reads in a sample-replicate. The default cutoff proportion is
0.001.
To choose a suitable cutoff, you can examine the
proportions of expected ASVs in your mock samples using
the suggest_sample_cutoff function. This ensures that
real, low-abundance ASVs are not accidentally removed
by setting the cutoff below the lowest observed
proportion.
suggest_sample_cutoff)Note: If you don’t yet know the exact sequences in your mock samples and therefore don’t have a mock_composition file, see How to make a mock_composition file for guidance.
outfile = file.path(outdir, "4_filter", "suggest_parameters", "sample_cutoff.csv")
suggest_sample_cutoff_df <- suggest_sample_cutoff(
read_count=read_count_df,
mock_composition=mock_composition,
outfile=outfile
)
knitr::kable(head(suggest_sample_cutoff_df), format = "markdown")
| sample | replicate | action | asv_id | sample_cutoff | read_count | read_count_sample_replicate | asv | taxon |
|---|---|---|---|---|---|---|---|---|
| tpos1 | 3 | keep | 1 | 0.0047 | 98 | 20468 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
| tpos1 | 1 | keep | 6 | 0.0084 | 186 | 22053 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 6 | 0.0087 | 239 | 27404 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 1 | 0.0090 | 247 | 27404 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
| tpos1 | 3 | keep | 6 | 0.0092 | 190 | 20468 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
| tpos1 | 2 | keep | 4 | 0.0118 | 324 | 27404 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | Synorthocladius semivirens |
filter_occurrence_sampleThe rows in suggest_sample_cutoff_df are sorted by the
sample_cutoff column in increasing
order.
tpos1 mock sample).filter_occurrence_sample cutoff
higher than this would remove some expected ASVs,
creating false negatives.Note:
sample_cutoff may be relatively high. In such
cases, it’s often safer to choose a low arbitrary value
(e.g., 0.001–0.005).In practice, strong PCR amplification bias is common, so scenarios where all mock species amplify equally well are rare.
When designing mock communities for a metabarcoding experiment, it’s recommended to include some species known to amplify poorly. This requires pre-testing and optimization, but ensures that your mock sample realistically reflects amplification biases before analyzing your actual environmental samples.
sample_cutoff <- 0.004
outfile <- file.path(outdir, "4_filter", "9_filter_occurrence_sample.csv")
read_count_df <- filter_occurrence_sample(
read_count_df,
cutoff=sample_cutoff,
outfile=outfile
)
param <- paste("cutoff:", sample_cutoff, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_occurrence_sample", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
| filter_occurrence_sample | cutoff: 0.004 | 51 | 112905 | 4 | 11 |
To improve repeatability, occurrences are kept only if they are
present in at least min_replicate_number replicates of the
same sample (2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "4_filter", "10_filter_min_replicate.csv")
read_count_df <- filter_min_replicate(
read_count_df,
cutoff=min_replicate_number,
outfile=outfile
)
param <- paste("cutoff:", min_replicate_number, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_min_replicate", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
| filter_occurrence_sample | cutoff: 0.004 | 51 | 112905 | 4 | 11 |
| filter_min_replicate | cutoff: 2 | 32 | 111561 | 4 | 11 |
filter_occurrence_variant and
filter_occurrence_read_count)The filter_occurrence_variant function removes
occurrences with low read counts relative to the total reads of
their ASV. By default, the cutoff proportion is
0.001. This filter is mainly designed to remove
spurious detections caused by tag-jumps or minor
cross-sample contamination.
The filter_occurrence_read_count function, in contrast,
removes occurrences based on an absolute read count
threshold, set to 10 by default.
filter_occurrence_variant and
filter_occurrence_read_count)The suggest_variant_readcount_cutoffs function helps you
find suitable cutoff values for both filters. It tests different
combinations and evaluates the number of:
To run this optimization, you need a known_occurrences_df data
frame that lists true positive (TP) and false
positive (FP) occurrences in control samples and in some cases
in real samples (identified due to incompatible habitat).
How to Obtain known_occurrences_df
There are two options:
1. Full control
You can create it manually using the classify_control_occurrences
function. You can then review or edit the resulting
known_occurrences_df before using it as input for
suggest_variant_readcount_cutoffs.
2. Automatic generation
Alternatively, you can run
suggest_variant_readcount_cutoffs with
known_occurrences = NULL (default), and provide:
mock_compositionsampleinfohabitat_proportionIn this case, the function will automatically call
classify_control_occurrences and use its output to estimate
optimal cutoff values.
All generated data frames are saved as CSV files in
the specified outdir.
Understanding The Output
The main result is a data frame that summarizes the performance of each cutoff combination.
Rows are sorted by:
variant_cutoffread_count_cutoffThis means the first row corresponds to the best compromise: minimal FN and FP, with the least stringent cutoff values.
In This Tutorial, we use the automatic approach,
where known_occurrences_df is generated by
suggest_variant_readcount_cutoffs. (If you prefer to build
it manually, see the classify_control_occurrences
section.)
We will use the default range of cutoff values, but
you can adjust the minimum, maximum, and step size for both filters (see
?suggest_variant_readcount_cutoffs).
Finally, we set min_replicate_number = 2 to remove
non-repeatable occurrences across the three replicates of each sample.
This is equivalent to applying the filter_min_replicate
filter.
outdir_suggest = file.path(outdir, "4_filter", "suggest_parameters")
suggest_variant_readcount_cutoffs_df <- suggest_variant_readcount_cutoffs(
read_count_df,
mock_composition = mock_composition,
sampleinfo = sampleinfo_df,
habitat_proportion = 0.5,
known_occurrences=NULL,
outdir= outdir_suggest,
min_replicate_number=2
)
knitr::kable(head(suggest_variant_readcount_cutoffs_df), format = "markdown")
| read_count_cutoff | variant_cutoff | FN | TP | FP |
|---|---|---|---|---|
| 10 | 0.001 | 0 | 6 | 4 |
| 15 | 0.001 | 0 | 6 | 4 |
| 20 | 0.001 | 0 | 6 | 4 |
| 25 | 0.001 | 0 | 6 | 4 |
| 30 | 0.001 | 0 | 6 | 4 |
| 35 | 0.001 | 0 | 6 | 4 |
filter_occurrence_variant and
filter_occurrence_read_countBased on the results, we choose:
0.001 as the cutoff for
filter_occurrence_variant10 as the cutoff for
filter_occurrence_read_countThis is the least stringent combination that still keeps all expected occurrences (6 TP, 0 FN) while minimizing false positives (4 FP).
We then apply both filters to the same input data
frame (the one used to find the cutoff values) and combine the
results using the pool_filters function. Only occurrences
that pass both filters are retained.
For more details, see ?filter_occurrence_variant,
?filter_occurrence_read_count, and
?pool_filters.
filter_occurrence_variant
outfile <- file.path(outdir, "4_filter", "11_filter_occurrence_variant.csv")
variant_cutoff = 0.001
read_count_df_variant <- filter_occurrence_variant(
read_count_df,
cutoff=variant_cutoff,
outfile=outfile
)
param <- paste("cutoff:", variant_cutoff, sep=" ")
stat_df <- get_stat(read_count_df_variant, stat_df, stage="filter_occurrence_variant", params=param)
filter_occurrence_read_count
read_count_cutoff <- 10
outfile <- file.path(outdir, "4_filter", "12_filter_occurrence_read_count.csv")
read_count_df_rc <- filter_occurrence_read_count(
read_count_df,
cutoff=read_count_cutoff,
outfile=outfile
)
param <- paste("cutoff:", read_count_cutoff, sep=" ")
stat_df <- get_stat(read_count_df_rc, stat_df, stage="filter_occurrence_read_count", params=param)
Combine results of filter_occurrence_variant and filter_occurrence_read_count
outfile <- file.path(outdir, "4_filter", "13_pool_filters.csv")
read_count_df <- pool_filters(
read_count_df_rc, read_count_df_variant,
outfile=outfile
)
stat_df <- get_stat(read_count_df, stat_df, stage="pool_filters", params=NA)
# delete temporary data frames
rm(read_count_df_rc)
rm(read_count_df_variant)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
| filter_occurrence_sample | cutoff: 0.004 | 51 | 112905 | 4 | 11 |
| filter_min_replicate | cutoff: 2 | 32 | 111561 | 4 | 11 |
| filter_occurrence_variant | cutoff: 0.001 | 32 | 111482 | 4 | 11 |
| filter_occurrence_read_count | cutoff: 10 | 31 | 111518 | 4 | 11 |
| pool_filters | NA | 31 | 111454 | 4 | 10 |
filter_min_replicate-2)We now run filter_min_replicate again to ensure
consistency across replicates of each sample (2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "4_filter", "14_filter_min_replicate.csv")
read_count_df <- filter_min_replicate(
read_count_df,
cutoff=min_replicate_number,
outfile=outfile
)
param <- paste("cutoff:", min_replicate_number, sep=" ")
stat_df <- get_stat(read_count_df, stat_df, stage="filter_min_replicate_2", params=param)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
| filter_occurrence_sample | cutoff: 0.004 | 51 | 112905 | 4 | 11 |
| filter_min_replicate | cutoff: 2 | 32 | 111561 | 4 | 11 |
| filter_occurrence_variant | cutoff: 0.001 | 32 | 111482 | 4 | 11 |
| filter_occurrence_read_count | cutoff: 10 | 31 | 111518 | 4 | 11 |
| pool_filters | NA | 31 | 111454 | 4 | 10 |
| filter_min_replicate_2 | cutoff: 2 | 30 | 111364 | 3 | 9 |
Replicates can now be combined for each sample.
For a given sample and ASV, the read count can be calculated in different ways:
For compatibility with other functions that use a
read_count data frame as input, the column name
read_count is kept, even if it actually represents a mean
or another type of aggregated value across replicates.
outfile <- file.path(outdir, "4_filter", "15_pool_replicates.csv")
read_count_sample_df <- pool_replicates(
read_count_df,
method = "mean", # Aggregate replicates using the mean read count
digits = 0, # Round aggregated read counts to integers
outfile = outfile
)
We now run classify_control_occurrences again to
evaluate the performance of the filtering. The output data frames are
also written to files for further inspection.
The performance_metrics file provides:
You can find:
false_negatives
fileknown_occurrences filefalse_negatives <- file.path(outdir, "4_filter", "false_negatives.csv")
performance_metrics <- file.path(outdir, "4_filter", "performance_metrics.csv")
known_occurrences <- file.path(outdir, "4_filter", "known_occurrences.csv")
results <- classify_control_occurrences(read_count_sample_df,
sampleinfo=sampleinfo_df,
mock_composition=mock_composition,
known_occurrences=known_occurrences,
false_negatives=false_negatives,
performance_metrics=performance_metrics)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
false_negatives_df <- results[[2]]
performance_metrics_df <- results[[3]]
knitr::kable(performance_metrics_df, format = "markdown")
| TP | FP | FN | Precision | Sensitivity |
|---|---|---|---|---|
| 6 | 4 | 0 | 0.6 | 1 |
This step prints the number of reads, ASVs, samples, and replicates remaining after each processing step, and saves the information to a CSV file for tracking your workflow.
write.csv(stat_df, file = file.path(outdir, "summary.csv"))
The assign_taxonomy_ltg function uses a
BLAST-based Last Common Ancestor (LCA) approach. For a
short description of how the algorithm works, see
?assign_taxonomy_ltg.
For details on the required formats of the taxonomy and
blast_db inputs, refer to the Reference
database for taxonomic assignments section.
You can also download a ready-to-use COI reference database (see Installation).
outfile <- file.path(outdir, "4_filter", "assign_taxonomy_ltg.csv")
asv_tax <- assign_taxonomy_ltg(asv=read_count_sample_df,
taxonomy=taxonomy,
blast_db=blast_db,
blast_path=blast_path, # can be omitted if BLAST is in the PATH
outfile=outfile)
If you are not interested in intra-specific variation, ASVs can be grouped into mOTUs.
The cluster_asv function supports two clustering
methods:
method = "vsearch": This method uses VSEARCH’s
cluster_size command, a greedy, centroid-based algorithm
that processes ASVs in order of decreasing read count. As a result,
cluster centroids correspond to the most abundant ASVs.
The clustering threshold is controlled by the identity
parameter.
method = "swarm": SWARM is a fast, iterative
clustering algorithm that groups ASVs differing by d or
fewer nucleotides. Clusters form around local abundance peaks and are
largely independent of the input order.
The main parameters are:
identity for VSEARCHswarm_d for SWARMFor more details, see the Clustering or denoising section.
In this tutorial, we focus on choosing an appropriate
identity threshold for VSEARCH, but the same general
approach can also be applied to SWARM.
To help choose a suitable clustering threshold, you can visualize the distribution of pairwise identities between ASVs, comparing:
This plot is generated using the
plot_pairwise_identity_vsearch function.
If you are using SWARM, see
?plot_pairwise_identity_swarm for the equivalent tool.
outfile <- file.path(outdir, "5_cluster", "pairwise_identities_vsearch.csv")
plotfile <- file.path(outdir, "5_cluster", "plot_pairwise_identity_vsearch.png")
plot_vsearch <- plot_pairwise_identity_vsearch(
read_count_sample_df,
identity_min=0.9,
identity_max=0.99,
identity_increment=0.01,
min_id = 0.8,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile,
plotfile=plotfile)
print(plot_vsearch)
Note: This graph is based on a very small test dataset and is therefore not very informative.
Below is the same plot generated using the full (though still relatively small) original dataset, which provides a clearer view of the patterns:
Clusters can also be evaluated using the taxonomic assignments of the ASVs they contain:
closed: All ASVs in the cluster are assigned to the
same taxon, and all ASVs from that taxon are found only in this
cluster.
open: All ASVs share the same taxonomic assignment,
but other ASVs from that taxon are also found in other
clusters.
hybrid: The cluster contains ASVs assigned to
multiple taxa.
You can perform this classification after clustering ASVs across a
range of parameter values (identity for VSEARCH or
swarm_d for SWARM), and then plot the number of clusters in
each category for each setting.
A good clustering parameter typically maximizes the number of
closed clusters while minimizing the number of
open and hybrid clusters.
(See ?plot_cluster_classification for more details.)
outfile <- file.path(outdir, "5_cluster", "classification_vsearch.csv")
plotfile <- file.path(outdir, "5_cluster", "plot_cluster_classification_vsearch.png")
scatterplot_vsearch <- plot_cluster_classification(
read_count=read_count_sample_df,
taxa=asv_tax,
clustering_method="vsearch",
cluster_params=c(0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99),
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
taxlevels= c("species", "genus"),
outfile=outfile,
plotfile=plotfile)
print(scatterplot_vsearch)
Note: This graph is based on a very small test dataset and is therefore not very informative.
Below is the same plot generated using the full (though still relatively small) original dataset, which provides a clearer view of the patterns:
####
Interpretation of the plots
Even though the original dataset is relatively small and the differences between parameter values are not very pronounced, a few useful conclusions can still be drawn:
For clustering thresholds below 0.94, the distribution of pairwise identities among ASVs within the same cluster becomes bimodal. This likely indicates that ASVs from different taxa (e.g. species) are being grouped together, suggesting that the clustering threshold should be at least 0.94.
The number of closed clusters at the species level
is highest between 0.94 and 0.97, while the number of
open and hybrid clusters remains relatively
low in this range.
Overall, these results suggest that a clustering threshold between 0.94 and 0.97 is a good choice for defining mOTUs as proxies for species.
The cluster_asv function can cluster ASVs using
either:
method = "vsearch")method = "swarm")The output format is controlled by the
group argument:
group = TRUE: ASVs in the same cluster are combined
into a single row.
asv_id and asv columns correspond to
the cluster centroidgroup = FALSE: The original ASV table is returned,
with an additional cluster_id column. Each row still
represents a single ASV
Clustering can be performed on the entire dataset
(by_sample = FALSE) or separately for each sample
(by_sample = TRUE). When defining mOTUs, it is recommended
to cluster all ASVs together (by_sample = FALSE).
Important: If you cluster by sample
(by_sample = TRUE) and use group = FALSE, the
same asv_id may be assigned different
cluster_id values in different samples, which can be
confusing. This combination is therefore not recommended.
In this tutorial, we cluster the data using VSEARCH with an identity
threshold of 0.97. (See ?cluster_asv for guidance on
parameter settings when using SWARM.)
outfile <- file.path(outdir, "5_cluster", "16_mOTU_vsearch.csv")
identity <- 0.97
read_count_samples_df_ClusterSize <- cluster_asv(
read_count_sample_df,
method = "vsearch",
identity = 0.97,
group = TRUE,
by_sample=FALSE,
path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile)
stat_df <- get_stat(read_count_samples_df_ClusterSize,
stat_df,
stage="cluster_asv_vsearch",
param=identity)
knitr::kable(stat_df, format = "markdown")
| parameters | asv_count | read_count | sample_count | sample_replicate_count | |
|---|---|---|---|---|---|
| input | NA | 10013 | 116447 | 4 | 12 |
| swarm_denoise | by_sample: TRUE split_clusters: TRUE | 1111 | 116446 | 4 | 12 |
| filter_asv_global | global_read_count_cutoff: 2 | 155 | 115442 | 4 | 12 |
| filter_indel | NA | 149 | 115282 | 4 | 12 |
| filter_stop_codon | genetic_code: 5 | 149 | 115282 | 4 | 12 |
| filter_contaminant | NA | 137 | 115172 | 4 | 12 |
| filter_chimera | by_sample: TRUE sample_prop: 0.8 abskew: 2 | 132 | 115152 | 4 | 12 |
| filter_replicate | renkonen_distance_quantile: 0.9 | 132 | 115125 | 4 | 11 |
| filter_occurrence_sample | cutoff: 0.004 | 51 | 112905 | 4 | 11 |
| filter_min_replicate | cutoff: 2 | 32 | 111561 | 4 | 11 |
| filter_occurrence_variant | cutoff: 0.001 | 32 | 111482 | 4 | 11 |
| filter_occurrence_read_count | cutoff: 10 | 31 | 111518 | 4 | 11 |
| pool_filters | NA | 31 | 111454 | 4 | 10 |
| filter_min_replicate_2 | cutoff: 2 | 30 | 111364 | 3 | 9 |
| cluster_asv_vsearch | 0.97 | 26 | 37378 | 3 | NA |
The write_asv_table function restructures your
read_count_df so that:
In other words, it converts the table from long format to wide format.
You can also include additional information in the output:
asv_tax)add_sums_by_asv)add_sums_by_sample)add_expected_asv)add_empty_samples)If pool_replicates = TRUE, replicates within each sample
are pooled, and the read count for each sample can be calculated as the
sum, mean, maximum, or minimum of the read counts
across replicates.
For more details, see ?write_asv_table.
outfile=file.path(outdir, "motu_table_taxa.csv")
asv_table_df <- write_asv_table(read_count_sample_df,
outfile=outfile,
asv_tax=asv_tax,
sampleinfo=sampleinfo_df,
add_empty_samples=FALSE,
add_sums_by_sample=TRUE,
add_sums_by_asv=TRUE,
add_expected_asv=TRUE,
mock_composition=mock_composition)
The second plate of the MFZR marker can be processed in the same way as the first one.
Just make sure to:
mfzr_plate1/ASV_list_with_IDs.csv as the
input_asv_list in the dereplicate step. This
ensures that ASV IDs remain consistent across both plates.An R script for analyzing the second MFZR plate is available here: xxxxxxxxxxxxx
Pool Filtered Datasets
Filtered datasets can be easily combined using the long format
(columns: asv_id, sample,
replicate (optional), read_count,
asv).
For pooling data from multiple plates, it is recommended to use the
filtered ASV tables
(e.g. 15_pool_replicates_xxx.csv) before clustering into
mOTUs. This helps avoid differences in cluster centroids between
datasets.
In this example, we will:
pooled_replicates_mfzr1 <- system.file(
"extdata/demo/15_pool_replicates_mfzr_plate1.csv",
package = "vtamR"
)
pooled_replicates_mfzr2 <- system.file(
"extdata/demo/15_pool_replicates_mfzr_plate2.csv",
package = "vtamR"
)
filtered_files <- c(
pooled_replicates_mfzr1,
pooled_replicates_mfzr2
)
outfile <- file.path("vtamR_demo_out", "pooled_mfzr_plates", "filtered_asv.csv")
pooled_read_count_df <- pool_datasets(
files = filtered_files,
outfile = outfile,
method = "mean" # If a sample appears in multiple plates, take the mean read count
)
outfile <- file.path("vtamR_demo_out", "pooled_mfzr_plates", "motu.csv")
identity <- 0.97
read_count_motu <- cluster_asv(
pooled_read_count_df,
method = "vsearch",
identity = 0.97,
group = TRUE,
path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outfile=outfile
)
While you could reuse taxonomic assignments from previous runs, it is usually simpler to rerun the assignment on the pooled dataset.
outfile <- file.path("vtamR_demo_out", "pooled_mfzr_plates", "tax_assign_ltg.csv")
asv_tax_pooled <- assign_taxonomy_ltg(
asv=read_count_motu,
taxonomy=taxonomy,
blast_db=blast_db,
blast_path=blast_path, # can be omitted if BLAST is in the PATH
outfile=outfile
)
outfile=file.path("vtamR_demo_out", "pooled_mfzr_plates", "motu_table_taxa.csv")
asv_table_df <- write_asv_table(
read_count_motu,
outfile=outfile,
asv_tax=asv_tax_pooled,
add_sums_by_sample=T,
add_sums_by_asv=T
)
Using two or more overlapping markers on the same samples can improve the taxonomic coverage of PCR amplification.
Each marker should first be analyzed separately. The results can then be combined by aggregating ASVs that are identical in their overlapping region.
As with the second plate of the MFZR marker, the first plate of ZFZR can be processed in the same way.
Make sure to:
mfzr_plate2/ASV_list_with_IDs.csv as the
input_asv_list in the dereplicate step. This
ensures that ASV IDs remain consistent across both
plates and markers.The corresponding results are included in the package:
mfzr_filtered <- system.file(
"extdata/demo/15_pool_replicates_mfzr_plate1.csv",
package = "vtamR"
)
zfzr_filtered <- system.file(
"extdata/demo/15_pool_replicates_zfzr_plate1.csv",
package = "vtamR"
)
If you have consistently used the ASV_list_with_IDs.csv
from previous runs for processing the ZFZR plate, the ASV IDs are
already compatible and can be used directly.
If the ASV IDs are not compatible (for example, the
same asv_id refers to different sequences across markers),
you can fix this by assigning a numerical identifier to each marker, for
instance, 1 for MFZR and 2 for ZFZR in this
example.
These identifiers are appended to the ASV IDs, keeping them unique and preventing confusion between markers.
input_files <- c(mfzr_filtered, zfzr_filtered)
marker_ids <- c(1, 2)
outfile <- file.path("vtamR_demo_out", "pooled_markers", "filtered_asv.csv")
asv_with_centroids <- file.path(
"vtamR_demo_out",
"pooled_markers",
"filtered_asv_with_centroids.csv"
)
pooled_markers <- pool_markers(
files = input_files,
marker_ids = marker_ids,
outfile = outfile, # ASVs identical over their overlapping region are merged into a single line
asv_with_centroids = asv_with_centroids, # same as input, with added centroid_id and centroid
method = "mean", # mean read counts across pooled ASVs
vsearch_path = "vsearch",
quiet = TRUE
)
As with multiple datasets for a single marker, we will cluster the ASVs from the pooled dataset into mOTUs, then assign the sequences to taxa, and finally create an mOTU table.
#### Cluster
outfile <- file.path("vtamR_demo_out", "pooled_markers", "motu.csv")
identity <- 0.97
read_count_motu <- cluster_asv(
pooled_markers,
method = "vsearch",
identity = 0.97,
group = TRUE,
path=vsearch_path,
outfile=outfile)
#### TaxAssign
outfile <- file.path("vtamR_demo_out", "pooled_markers", "assign_taxonomy_ltg.csv")
asv_tax_pooled <- assign_taxonomy_ltg(
asv=read_count_motu,
taxonomy=taxonomy,
blast_db=blast_db,
blast_path=blast_path,
outfile=outfile
)
#### Print mOTU table
outfile <- file.path("vtamR_demo_out", "pooled_markers", "motu_table_taxa.csv")
asv_table_df <- write_asv_table(
read_count_motu,
outfile=outfile,
asv_tax=asv_tax_pooled,
add_sums_by_sample=T,
add_sums_by_asv=T
)
Most vtamR functions can take either a data frame or a
CSV file as input.
This is useful if you want to resume your analysis without rerunning earlier steps. For example, if you’ve closed your R session but saved your data frame to a file.
Many function arguments can accept either format, as long as the file or data frame contains the required columns. Common examples include:
read_countfastqinfofastainfosampleinfoinput_asv_listmock_compositionasvltg_paramsasv_taxCheck the help page of each function for the exact
required columns: ?function_name.
Many functions return a data frame. If you also specify
outfile, the function will write the data frame to a CSV
file at the same time.
Saving outputs as CSV is a good practice for long workflows. It allows you to:
During the first steps of the workflow
(merge_fastq_pairs, subsample_fasta,
trim_primers, demultiplex_and_trim), many
intermediate files are generated. Some of these files can be very large,
especially before dereplication (dereplicate).
Compressing and decompressing files between steps can take
significant time. For this reason, vtamR outputs
uncompressed files by default to speed up processing.
However, uncompressed files can quickly consume disk space. Remember to delete intermediate files once your analysis is complete. See Files to Keep for guidance on which files to retain.
If storage is limited, you can enable file compression by setting
compress = TRUE in relevant functions (e.g.,
subsample_fasta, trim_primers,
demultiplex_and_trim). This produces
gzip-compressed output files.
The method used is controlled by the compress_method
argument. Three options are available:
"pigz" – Uses the PIGZ program
(parallel gzip). Requires installation, and if not in the system
PATH, its location must be provided via
pigz_path. Works on Linux-like systems and Windows. Uses
multithreading, making it the fastest option.
"gzip" – Uses the standard gzip command
available on Linux-like systems. Slower than pigz and not
available on Windows.
"R" – Implemented in R via the R.utils
package. Cross-platform, works everywhere, but is the slowest
method.
For a 19 GB FASTA file (compressed size: 3 GB) on Linux:
| Method | Compression (min) | Decompression (min) |
|---|---|---|
| R.utils | 16 | 3 |
| gzip | 13 | 3 |
| pigz | 3 | 2 |
PIGZ is clearly the best
option.The effect of compression may vary depending on which external programs are used:
The cluster_asv function supports two algorithms for
grouping ASVs:
method = "vsearch" Uses VSEARCH’s
cluster_size command, a greedy, centroid-based
algorithm. ASVs are sorted by decreasing read count, so the
most abundant ASVs become cluster centroids. The cluster radius is
controlled by the identity parameter.
method = "swarm" Uses SWARM, a
fast, iterative algorithm that clusters ASVs differing
by d or fewer nucleotides. Clusters form naturally around
local abundance peaks and are mostly insensitive to the input order.
Although both approaches perform clustering, they are often used for different goals.
SWARM was originally designed as a denoising method for metabarcoding data. The principle is simple:
True biological ASVs tend to have high read counts, while very similar ASVs with lower read counts are likely sequencing errors.
Denoising groups likely erroneous sequences into the cluster of their true ASV, preserving ASV-level diversity.
Running SWARM with d = 1 and the fastidious
option activated corresponds to this denoising goal.
Limitation: If two real biological variants differ by only one mutation, they may be merged, causing some loss of true variability.
Tip: While cluster_asv can run SWARM
with d = 1 for denoising, the
denoise_by_swarm function is specifically
designed for this purpose. It runs SWARM with the same settings but adds
a split_clusters option to retain
intra-specific variability. After running SWARM sample by sample,
clusters that contain ASVs with high relative abundance compared to the
centroid can be split, preserving biologically meaningful diversity.
Clustering aims to group ASVs (both true and erroneous) into units approximating biological taxa (species, genera, etc.).
cluster_size) uses a fixed identity threshold
for this.d values can also produce taxon-like
clusters.Choosing the right threshold depends on the genetic marker and
taxonomic group. vtamR provides tools to explore clustering
outcomes across multiple thresholds.
For a range of thresholds, vtamR can generate density
plots of pairwise identities:
plot_pairwise_identity_vsearch() – for VSEARCH
thresholdsplot_pairwise_identity_swarm() – for SWARM
d valuesThese plots show how ASV similarities vary within and between clusters.
Clusters can also be classified based on the taxonomic assignments of the ASVs they contain:
The function PlotClusterClassification() shows the
number of clusters in each class across thresholds.
These visualizations help you choose an appropriate clustering threshold:
VSEARCH) or d
(SWARM) affects cluster structure.For more guidance, see the Make mOTUs section.
num_threadsBy default, vtamR automatically uses all available CPU cores for
multithreaded processes. If you want to maximize performance, there is
no need to specify the num_threads argument.
However, you may want to limit the number of cores used—for example,
to leave resources for other tasks or users. In that case, simply set
the num_threads argument in the relevant function.
Example: This limits merge_fastq_pairs
to use 8 CPU threads:
fastainfo_df <- merge_fastq_pairs(
fastqinfo,
fastq_dir =fastq_dir,
vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
outdir =merged_dir,
num_threads =8
)
quietSome external programs used by vtamR can produce a lot of console output, even when they are run in their own “quiet” mode. This can make it hard to spot important warnings or errors.
The quiet argument controls whether this output is
displayed:
quiet = TRUE (default): suppresses most messages,
keeping logs clean.quiet = FALSE: shows all output, which can be useful
for troubleshooting.Tip: If a function fails or behaves unexpectedly,
rerun it with quiet = FALSE to inspect the full messages
and identify the problem.
sepThe sep argument defines the separator used in CSV
files.
"," (comma)Exception: The taxonomy
file of the reference database must be tab-separated
(\t).
Some vtamR functions rely on external programs such as VSEARCH, CUTADAPT, BLAST, SWARM, or PIGZ. To use these tools, vtamR needs to know where their executables are located.
If these programs are already in your system PATH, vtamR will find them automatically. You don’t need to specify their locations when calling functions.
If the programs are not in your PATH, you must provide their paths explicitly using the relevant arguments:
vsearch_path # Path to VSEARCH
cutadapt_path # Path to CUTADAPT
blast_path # Path to BLAST
swarm_path # Path to SWARM
pigz_path # Path to PIGZ
To see which external tools a function requires, check its help page:
?function_name
During the first steps of the workflow
(merge_fastq_pairs, subsample_fasta,
Sortreads, trim_primers), many intermediate
files are generated, and some can be very large, especially before
dereplication (dereplicate).
Compressing and decompressing files between steps can take time, so vtamR produces uncompressed files by default. You can change this behavior using the compress option.
Here’s a practical guide on what to keep, delete, or compress:
merge_fastq_pairs:
fastainfo.csv for read count information after
merging.subsample_fasta:
demultiplex_and_trim:
sampleinfo.csv for read count information after
trimming.dereplicate:
1_before_filter.csv in the tutorial).ASV_list_with_IDs.csv), which can be reused for later
datasets to synchronize asv_ids.Output of filtering steps:
This strategy balances disk space and reproducibility, keeping only the essential information to resume or audit your analyses.
filter_pcr_errorIf you have already performed denoising using
SWARM (denoise_by_swarm), this function
may be redundant, as denoising already removes many likely
PCR/sequencing errors.
filter_pcr_error works by flagging an ASV as a PCR error
if it is very similar to a more abundant ASV.
max_mismatch: maximum number of mismatches allowed
between two ASVs to evaluate for PCR error.pcr_error_var_prop: if the less abundant ASV has a read
count proportion ≤ this value relative to the more abundant ASV, it is
flagged as a PCR error. Default = 0.1 (arbitrary).suggest_pcr_error_cutoff: Suggesting a PCR Error
CutoffThe default pcr_error_var_prop may not be ideal for your
dataset. Use suggest_pcr_error_cutoff to guide the
choice.
This function:
max_mismatch = 1) in mock samples where one ASV is expected
and the other is not.min_read_count reads
(default = 5) to avoid stochastic noise from very low counts.The output CSV lists all pairs, sorted by decreasing
pcr_error_var_prop. The highest read count ratio observed
among unexpected ASVs can be used as a lower bound for
pcr_error_var_prop.
outfile <- file.path(outdir, "suggest_parameters", "suggest_pcr_error_cutoff.csv")
suggest_pcr_error_cutoff_df <- suggest_pcr_error_cutoff(
read_count_df,
mock_composition=mock_composition,
vsearch_path=vsearch_path,
outfile=outfile
)
Tip: If you do not yet have the exact sequences for your mock samples, see How to make a mock_composition file before running this function.
filter_pcr_errorOnce you have determined an appropriate
pcr_error_var_prop, run:
outfile <- file.path(outdir, "filter_pcr_error.csv")
read_count_df <- filter_pcr_error(
read_count_df,
outfile=outfile,
vsearch_path=vsearch_path,
pcr_error_var_prop=0.15
)
This will remove ASVs likely caused by PCR errors, cleaning your dataset for downstream analyses.
For more details and advanced options, see
?filter_pcr_error.
subsample_fastaThe subsample_fasta function randomly selects
n sequences (without replacement) from each input FASTA
file listed in the fastainfo data frame.
This step is useful for normalizing sequencing depth across
libraries. It can be applied before or after demultiplexing
(demultiplex_and_trim).
Note: If your workflow includes
filter_occurrence_variantorfilter_occurrence_read_count, which rely on total reads in negative controls, standardizing read numbers across samples is not recommended. Typically,subsample_fastais applied aftermerge_fastq_pairsbut beforedemultiplex_and_trimto ensure all replicate series have the same number of reads.
1. Using VSEARCH
(use_vsearch = TRUE)
Uses the fastx_subsample
command from VSEARCH.
Very fast, but only available on Linux-like systems.
Supports compression of input/output files using:
pigz: fastest, multithreaded, requires
pigz_path if not in PATH.gzip: slower, Linux only.R: cross-platform but slowest (via
R.utils).2. Using R only
(use_vsearch = FALSE)
fastainfo – CSV file or data frame with columns:
tag_fw, primer_fw, tag_rv,
primer_rv, sample, sample_type,
habitat, replicate, fasta.fasta_dir – Directory containing input FASTA
files.n – Number of sequences to sample (without
replacement).outdir – Directory for output FASTA files.use_vsearch – Whether to use VSEARCH for fast
sampling.compress – Whether to compress output files
(TRUE or FALSE).compress_method – "pigz",
"gzip", or "R".randomseq_dir <- file.path(outdir, "random_seq")
fastainfo_df <- subsample_fasta(
fastainfo = fastainfo_df,
fasta_dir = merged_dir,
outdir = randomseq_dir,
use_vsearch = TRUE,
vsearch_path = vsearch_path, # can be omitted if VSEARCH is in the PATH
compress_method = "pigz",
pigz_path = pigz_path, # can be omitted if PIGZ is in the PATH
n = 10000,
compress = TRUE)
randomseq_dir <- file.path(outdir, "random_seq")
fastainfo_df <- subsample_fasta(
fastainfo = fastainfo_df,
fasta_dir = merged_dir,
outdir = randomseq_dir,
use_vsearch = FALSE,
n = 10000,
compress = TRUE)
Tip: Choose
use_vsearch = TRUEwhen possible for speed. UseFALSEfor Windows or macOS, or when VSEARCH is not installed.
classify_control_occurrencesThe classify_control_occurrences function identifies
false positive, false negative, and
true positive occurrences in control
samples (mock and negative).
Some false positives can be also detected in real samples, if the dataset contains samples from different habitats.
known_occurrences: Data frame (or CSV file)
containing:
false_negatives: Data frame (or CSV file) of ASVs
that were expected in mock samples but not detected.
performance_metrics: Data frame summarizing counts
of true positives, false positives, and false negatives.
For details on parameters and options, see
?classify_control_occurrences.
read_count_df: Data frame containing ASV read counts
for each sample.sampleinfo_df: Must include columns:
sample, sample_type, replicate,
and habitat. Any data frame with these columns is
acceptable. The easiest is the output of
demultiplex_and_trimTip: If you restarted the analysis after
demultiplex_and_trimandsampleinfo_dfis not in memory, use the sampleinfo CSV file invtamR_demo_out/demultiplexed.
# Detect known occurrences
results <- classify_control_occurrences(read_count = read_count_df,
sampleinfo=sampleinfo_df,
mock_composition=mock_composition)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
false_negatives_df <- results[[2]]
performance_metrics_df <- results[[3]]
compute_asv_specific_cutoffSome ASVs may appear in many samples with high read counts, which can
make them resistant to a fixed cutoff in
filter_occurrence_variant. This can occur due to tag-jump
or inter-sample contamination.
The compute_asv_specific_cutoff function computes
ASV-specific cutoff values for variants with known false-positive
occurrences (identified by classify_control_occurrences),
so filtering can be more targeted.
For each ASV:
by_replicate = TRUE.Some false positives may not be due to tag-jump, so raw cutoff values can be excessively high.
- Filter the dataset thoroughly before computing ASV-specific cutoffs.
- Use the
max_cutoffparameter to set a reasonable upper limit.
asv_spec <- compute_asv_specific_cutoff(
read_count_df,
max_cutoff=0.05,
mock_composition=mock_composition_df
)
Output: A data frame asv_spec with columns:
asv_id: Identifier of the ASVcutoff_asv_spec: ASV-specific cutoff, capped at
max_cutoff = 0.05Only ASVs with known false-positive occurrences are included.
Pass asv_spec to
filter_occurrence_variant:
read_count_df <- filter_occurrence_variant(
read_count_df,
asv_specific_cutoffs = asv_spec,
min_read_count_prop=0.7
)
min_read_count_prop of
reads remain for some ASVs after filtering. This suggests the cutoff may
be too high and should be reduced.You can also use a global cutoff together with ASV-specific cutoffs.
The global cutoff is applied to:
read_count_df <- filter_occurrence_variant(
read_count_df,
cutoff=0.002,
asv_specific_cutoffs = asv_spec,
min_read_count_prop=0.7
)
This approach allows flexible, per-ASV filtering while maintaining a minimum global threshold for all other variants.
assign_taxonomy_rdpTaxonomic assignment using the RDP classifier
For 16S rRNA data, the RDP (Ribosomal
Database Project) provides a curated reference database for
Bacteria and Archaea. The pre-trained sets are included
in the rRDPData package, and the
rRDP package performs the taxonomic
classification.
rRDP and rRDPDataThese packages are available through Bioconductor
and are not installed automatically with
vtamR:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rRDP")
BiocManager::install("rRDPData")
The assign_taxonomy_rdp() function is a wrapper around
the RDP classifier, allowing direct application to your ASV dataset.
taxa <- assign_taxonomy_rdp(
asv=read_count_df,
max_memory = 8,
confidence = 0.8,
rm_chloroplast = TRUE
)
Arguments:
asv: CSV file path or data frame containing at
least:
asv_id: unique numeric identifierasv: nucleotide sequencemax_memory: RAM available for RDP (GB)
confidence: minimum confidence threshold for
assignment
rm_chloroplast: if TRUE, sequences
classified as Chloroplast are replaced with
NA
You can provide your own RDP-trained dataset instead of the default
from rRDPData. This requires a folder containing the
trained classifier created with trainRDP():
taxa <- assign_taxonomy_rdp(
asv=read_count_df,
dir="path/to/custom_RDP_trained_DB"
max_memory = 8,
confidence = 0.8,
rm_chloroplast = TRUE)
For more details, see ?rRDP::rdp for training and using
custom classifiers.
history_byThe history_by function allows you to track the
filtering history of specific features (ASVs, samples, or
replicates) across multiple filtering steps.
dir) whose names start
with a number. These are typically the outputs from different filtering steps.dir: Path to the directory containing the filtering
output files.feature: Column to track (asv_id,
asv, sample, or replicate).value: The specific value of the feature you want to
follow.Get the history of asv_id (feature) 27
(value).
filter_dir <- file.path(outdir, "4_filter")
tmp_ASV_27 <- history_by(dir=filter_dir,
feature="asv_id",
value="27")
knitr::kable(head(tmp_ASV_27), format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 1_before_filter.csv | 27 | tpos1 | 1 | 1 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCATTTTCATGGTTATGCCAGGACTTGTT |
Get the history of a sequence of the ASV
(feature).
tmp_replicate_1 <- history_by(
dir=filter_dir,
feature="asv",
value="ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC")
knitr::kable(tmp_replicate_1, format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 1_before_filter.csv | 1 | tpos1 | 1 | 375 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 1_before_filter.csv | 1 | tpos1 | 2 | 188 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 1_before_filter.csv | 1 | tpos1 | 3 | 66 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 2_swarm_denoise.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 2_swarm_denoise.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 2_swarm_denoise.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 3_filter_asv_global.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 3_filter_asv_global.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 3_filter_asv_global.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 4_filter_indel.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 4_filter_indel.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 4_filter_indel.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 5_filter_stop_codon.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 5_filter_stop_codon.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 5_filter_stop_codon.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 6_filter_contaminant.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 6_filter_contaminant.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 6_filter_contaminant.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 7_filter_chimera.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 7_filter_chimera.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 7_filter_chimera.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 8_filter_replicate.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 8_filter_replicate.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 8_filter_replicate.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 9_filter_occurrence_sample.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 9_filter_occurrence_sample.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 9_filter_occurrence_sample.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 10_filter_min_replicate.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 10_filter_min_replicate.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 10_filter_min_replicate.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 11_filter_occurrence_variant.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 11_filter_occurrence_variant.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 11_filter_occurrence_variant.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_filter_occurrence_read_count.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_filter_occurrence_read_count.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 12_filter_occurrence_read_count.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_pool_filters.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_pool_filters.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 13_pool_filters.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_filter_min_replicate.csv | 1 | tpos1 | 1 | 520 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_filter_min_replicate.csv | 1 | tpos1 | 2 | 247 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 14_filter_min_replicate.csv | 1 | tpos1 | 3 | 98 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 15_pool_replicates.csv | 1 | tpos1 | NA | 288 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
Get the history of the sample (feature) tpos1
(value).
tmp_sample_tpos1 <- history_by(dir=filter_dir,
feature="sample",
value="tpos1")
knitr::kable(head(tmp_sample_tpos1), format = "markdown")
| file | asv_id | sample | replicate | read_count | asv |
|---|---|---|---|---|---|
| 1_before_filter.csv | 12 | tpos1 | 1 | 6 | AATGTATCTAATCTTTGGAGGTTTTTCTGGTATTATTGGAACAGCTTTATCTATTTTAATCAGAATAGAATTATCGCAACCAGGAAACCAAATTTTAATGGGAAACCATCAATTATATAATGTAATTGTAACTTCTCACGCTTTTATTATGATTTTTTTTATGGTAATGCCAATTTTATTA |
| 1_before_filter.csv | 13 | tpos1 | 1 | 3 | ACCTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
| 1_before_filter.csv | 14 | tpos1 | 1 | 1 | ACTACACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 1_before_filter.csv | 15 | tpos1 | 1 | 1 | ACTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCCATGATT |
| 1_before_filter.csv | 16 | tpos1 | 1 | 1 | ACTATACCTTACCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGGAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 1_before_filter.csv | 17 | tpos1 | 1 | 1 | ACTATACCTTATCTTCGCAGTATTCTCAAGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
summarize_byThe summarize_by function is used to summarize how your
data changes across filtering steps.
Scans all files in a directory (dir) whose names
start with a number (i.e. outputs from the different filtering steps).
For each file, it:
asv, asv_id, sample, or
replicate)asv,
asv_id, sample, replicate, or
read_count)If feature = "read_count", it returns the
sum of reads for each group.
Otherwise, it returns the number of distinct values of the feature within each group.
Get the number of reads per sample after each filtering step:
read_count_by_sample <- summarize_by(
dir=filter_dir,
feature="read_count",
grouped_by="sample"
)
knitr::kable(read_count_by_sample, format = "markdown")
| sample | 1_before_filter | 2_swarm_denoise | 3_filter_asv_global | 4_filter_indel | 5_filter_stop_codon | 6_filter_contaminant | 7_filter_chimera | 8_filter_replicate | 9_filter_occurrence_sample | 10_filter_min_replicate | 11_filter_occurrence_variant | 12_filter_occurrence_read_count | 13_pool_filters | 14_filter_min_replicate | 15_pool_replicates |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14ben01 | 26149 | 26149 | 25994 | 25972 | 25972 | 25969 | 25969 | 25969 | 25750 | 25620 | 25620 | 25620 | 25620 | 25620 | 8540 |
| 14ben02 | 19428 | 19427 | 19152 | 19031 | 19031 | 19031 | 19031 | 19031 | 18494 | 17600 | 17600 | 17600 | 17600 | 17600 | 6008 |
| tnegtag | 388 | 388 | 334 | 334 | 334 | 227 | 227 | 200 | 200 | 197 | 118 | 154 | 90 | 0 | 0 |
| tpos1 | 70482 | 70482 | 69962 | 69945 | 69945 | 69945 | 69925 | 69925 | 68461 | 68144 | 68144 | 68144 | 68144 | 68144 | 22830 |
Get the number of ASVs for each sample after each filtering steps.
asv_by_sample <- summarize_by(
dir=filter_dir,
feature="asv",
grouped_by="sample"
)
knitr::kable(asv_by_sample, format = "markdown")
| sample | 1_before_filter | 2_swarm_denoise | 3_filter_asv_global | 4_filter_indel | 5_filter_stop_codon | 6_filter_contaminant | 7_filter_chimera | 8_filter_replicate | 9_filter_occurrence_sample | 10_filter_min_replicate | 11_filter_occurrence_variant | 12_filter_occurrence_read_count | 13_pool_filters | 14_filter_min_replicate | 15_pool_replicates |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 14ben01 | 1906 | 176 | 26 | 25 | 25 | 24 | 24 | 24 | 5 | 2 | 2 | 2 | 2 | 2 | 2 |
| 14ben02 | 2842 | 328 | 64 | 60 | 60 | 60 | 60 | 60 | 36 | 19 | 19 | 19 | 19 | 19 | 19 |
| tnegtag | 174 | 75 | 25 | 25 | 25 | 13 | 13 | 12 | 12 | 9 | 5 | 5 | 3 | 0 | 0 |
| tpos1 | 5381 | 572 | 79 | 77 | 77 | 77 | 72 | 72 | 13 | 10 | 10 | 10 | 10 | 10 | 10 |
Get the number asv_id for each replicate after each filtering steps.
asvid_by_replicate <- summarize_by(
dir=filter_dir,
feature="asv_id",
grouped_by="replicate"
)
#> [1] "WARNING: replicate variable is not present in vtamR_demo_out/mfzr_plate1/4_filter/15_pool_replicates.csv"
knitr::kable(asvid_by_replicate, format = "markdown")
| replicate | 1_before_filter | 2_swarm_denoise | 3_filter_asv_global | 4_filter_indel | 5_filter_stop_codon | 6_filter_contaminant | 7_filter_chimera | 8_filter_replicate | 9_filter_occurrence_sample | 10_filter_min_replicate | 11_filter_occurrence_variant | 12_filter_occurrence_read_count | 13_pool_filters | 14_filter_min_replicate |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 4389 | 438 | 119 | 117 | 117 | 111 | 107 | 107 | 38 | 27 | 27 | 27 | 27 | 27 |
| 2 | 4461 | 484 | 114 | 112 | 112 | 100 | 96 | 96 | 36 | 30 | 30 | 29 | 29 | 28 |
| 3 | 4283 | 378 | 95 | 89 | 89 | 82 | 81 | 81 | 30 | 28 | 28 | 26 | 26 | 26 |
update_asv_listThe update_asv_list function merges unique
asv–asv_id pairs from two input data
frames or CSV files.
Combines ASV lists from two sources
Checks for inconsistencies within or between the datasets
This ensures that asv_ids remain
consistent across different analyses.
The safest and simplest approach is to let dereplicate
handle this automatically:
Provide both:
input_asv_list (from previous analyses)output_asv_list (to store the updated list)In this case:
asv_ids are automatically synchronized with previous
datasetsupdate_asv_list manuallyWhen processing many large datasets, the ASV list can become very large (often filled with singletons that will be filtered out later). This can lead to memory issues.
A lighter workflow is often more practical:
input_asv_list in dereplicate to keep
asv_ids consistentoutput_asv_list at this
stagefilter_asv_global), run
update_asv_list on the filtered data only.This way:
asv_list <- "PATH/TO/CSV/WITH/Earlier_asv-asv_id_pairs.csv"
updated_asv_list <- file.path(outdir, "updated_ASV_list.csv")
update_asv_list(
asv_list1 = read_count_df,
asv_list2 = asv_list,
outfile = updated_asv_list)
pool_replicatesReplicates are used to ensure repeatability and reduce experimental variation. After the analysis, they can be pooled at the sample level.
The pool_replicates function performs this step and is
automatically called by write_asv_table when
pool_replicates = TRUE.
For each sample and ASV, the function computes the read count across
replicates using the method specified by the method
argument. The default is mean, which calculates the average
of non-zero read counts. Alternatively, you can use sum to
obtain the total read count across replicates, min to
retain the lowest value, or max to keep the highest
value.
outfile <- file.path(outdir, "PooledReplicates.csv")
read_count_samples_df <- pool_replicates(
read_count_df,
outfile=outfile
)
count_reads_in_dirThe count_reads_in_dir function counts the number of
reads in all FASTA or FASTQ files
found in a given directory. It works with both compressed
(.gz) and uncompressed files, but
.zip files are not supported.
In most cases, you do not need to use this function during a standard
workflow. The fastainfo and sampleinfo files already include read counts
after merge_fastq_pairs and
demultiplex_and_trim.
However, this function can be useful if you want to check the number of reads in your raw input FASTQ files before starting the analysis.
dir: Input directory containing the fasta or fastq
filesfile_type: fasta or fastqpattern: Regular expression; only file names matching
this pattern in their name are processeddf <- count_reads_in_dir(
dir=fastq_dir,
pattern="_fw.fastq.gz",
file_type="fastq"
)
CSV file with information on input fastq files, primers, tags, samples with the following columns. Each line corresponds to a sample–replicate combination.
tag_fw: Sequence tag on the 5’ of the fw read (NA if
file is already demultiplexed)primer_fw: Forward primer (NA if primer has been
trimmed)tag_rv: Sequence tag on the 3’ of the rv read (NA if
file is already demultiplexed)primer_rv: Reverse primer (NA if primer has been
trimmed)sample: Name of the sample (alpha-numerical)sample_type: real/mock/negativehabitat: If real or mock samples are from different
habitats that cannot contain the same type of organisms
(e.g. terrestrial vs. marine), this information is used for detecting
false positives. Use NA otherwise. Use NA for negative controls.replicate: Numerical id of a replicate within sample
(e.g. Sample1 can have replicate 1, 2 or 3)fastq_fw: Forward fastq filefastq_rv: Reverse fastq fileCSV file with information on input fasta files, primers, tags, samples with the following columns. Each line corresponds to a sample–replicate combination.
tag_fw: Sequence tag on the 5’ of the fw read (NA if
file is already demultiplexed)primer_fw: Forward primer (NA if primer has been
trimmed)tag_rv: Sequence tag on the 3’ of the rv read (NA if
file is already demultiplexed)primer_rv: Reverse primer (NA if primer has been
trimmed)sample: Name of the sample (alpha-numerical)sample_type: real/mock/negativehabitat: If real or mock samples are from different
habitats that cannot contain the same type of organisms
(e.g. terrestrial vs. marine), this information is used for detecting
false positives. Use NA otherwise. Use NA for negative controls.replicate: Numerical id of a replicate (e.g. Sample1
can have replicate 1, 2 or 3)fasta: Fasta fileRead_count: Number of reads in the fasta file.
Optional.CSV file with information on demultiplexed and primer trimmed fasta files and samples with the following columns. Each line corresponds to a sample–replicate combination.
sample: Name of the sample (alpha-numerical)sample_type: real/mock/negativehabitat: If real or mock samples are from different
habitats that cannot contain the same type of organisms
(e.g. terrestrial vs. marine), this information is used for detecting
false positives. Use NA otherwise. Use NA for negative controls.replicate: Numerical id of a replicate (e.g. Sample1
can have replicate 1, 2 or 3)fasta: Fasta fileRead_count: Number of reads in the fasta file.
Optional.CSV file with the following columns.
sample: Name of the mock
sample
action:
asv: sequence of the ASV
taxon: Optional; Name of the organism
asv_id: Optional; If there is a conflict between asv and asv_id, the asv_id is ignored
Fasta file containing one reference sequence for each species expected in the mock samples. Each sequence should span at least 70% of the region targeted by the primers. It can be longer or slightly shorter than the ASV and may include minor mismatches relative to the true sequence.
Fasta headers must include a valid NCBI taxonomic identifier in the following format:
>SequenceName taxID=12345
The taxID must correspond to an entry in the NCBI Taxonomy
database.
CSV file or data frame with the following columns.
sample: Name of the sample
action:
asv_id: optional
asv: sequence of the ASV
This file lists ASVs and their IDs from earlier data sets. When provided (optional), identical ASVs in the present and earlier data sets have the same ID. New ASVs (not present in asv_list) will get unique IDs not present in asv_list. It is a CSV file with the following columns:
Data frame with the following columns:
assign_taxonomy_ltgA data base is composed of two elements: a BLAST database
(blast_db) and a taxonomy file.
blast_db can be produced using the
makeblastdb command of BLAST:
makeblastdb -dbtype nucl -in [FASTA_FILE] -parse_seqids -taxid_map [TAXID_FILE] -out [DB_NAME]
taxonomy is a tab separated csv file
with the following columns:
tax_id: Numerical Taxonomic ID. It can be a valid NCBI taxID, or
arbitrary negative numbers for taxa not in NCBI.parent_tax_id: taxID of the closest parent of
tax_id.rank: taxonomic rank (e.g. species, genus, subgenus,
no_rank).name_txt: Scientific name of the taxon.old_tax_id: taxIDs that have been merged to the tax_id
by NCBI; if there is more than one for a given tax_id, make one line for
each old_tax_id.taxlevel: Integer associated to each major taxonomic
rank. (0 => root, 1=> domain, 2=> kingdom, 3=> phylum,
4=> class, 5=> order, 6=> family, 7=> genus, 8=>
species). Levels in between have 0.5 added to the next highest level
(e.g. 5.5 for infraorder and for superfamily).A ready-to-use COI database in BLAST format and the associated
taxonomy file can be downloaded from https://osf.io/vrfwz/. It was created
using mkCOInr. It is
also possible to make a customized
database using mkCOInr. It can be downloaded and extracted manually
or by using the download_osf function of vtamR. See Installation.
ASV: Amplicon Sequence Variant. A unique sequence, characterized by the number of reads in each sample–replicate.
asv_id: Unique numerical ID of an ASV.
demultiplexing: Sorting reads in a fasta (or fastq) file into different sample–replicates according to the tags present at their extremities.
dereplication: The merged reads contain many identical sequences. Dereplication reduces the data set to unique sequences (ASVs) and counts the number of reads for each ASV in each sample–replicate.
data frame (or csv) in long format: The
read_count_df contains one line for each occurrence with
asv_id, asv, sample,
replicate, read_count columns. This is the
standard format used throughout the analyses and is generally more
compact than wide format when many ASVs occur in only a few
samples.
data frame (or csv) in wide format: The
read_count_df can be rearranged so that rows are ASVs,
columns are sample(-replicates), and cells contain read counts. This
format is more human-readable and is typically used for final ASV or
mOTU tables, often combined with taxonomic assignments and other
metadata.
false negative occurrence: An expected ASV in a mock sample that is not detected in the data.
false positive occurrence: An unexpected presence of an ASV in a sample:
habitat: Habitat type of organisms in real or mock samples. Use this only if organisms from different habitats cannot occur in each other. Use NA otherwise.
long format: Each row represents a single
measurement, typically with columns such as asv,
sample, and read_count. This tidy format is
well suited for many data manipulation tools.
Low Frequency Noise (LFN): ASVs present at low frequencies, likely resulting from sequencing or amplification errors.
merge: Assemble a forward and reverse read pair into a single sequence.
mock sample: An artificial mixture of DNA from known organisms.
negative sample: Negative control sample.
precision: The proportion of detected positives that are true positives (TP / (TP + FP)).
real sample: An environmental sample.
replicate: Technical or biological replicate of
a sample. Replicates must have numerical identifiers (e.g. sample
tops1 can have replicates 1, 2, and 3).
sample: Name of the environmental or control sample. Must be alphanumeric and contain no spaces.
sample–replicate: A specific replicate of a given sample.
sample_type: real/mock/negative
sensitivity: The proportion of true positives correctly detected (TP / (TP + FN)).
singleton: ASV with a single read in the entire data set.
tag: Short sequence at the extremity of the amplicon, used during demultiplexing to identify the sample–replicate origin of each read.
taxIDs: Numerical taxonomic identifiers. These can be valid NCBI taxID values or arbitrary negative numbers for taxa not present in NCBI.
trimming: Cutting sequence extremities, either based on sequence quality or by detecting tags or primers.
true positive occurrence: An expected occurrence in a mock sample.
wide format: In vtamR, wide-format
data frames have one row per ASV and one column per sample, with read
counts in the cells.
Pooling different sequencing runs is useful when the same samples were sequenced in multiple runs, for example if one sequencing run was of lower quality or if technical replication was performed. Pooling can be done before filtering to create a unified dataset.
For each sequencing run, perform the following steps:
Follow the steps in from-fastq-to-df, adapting to your wet lab and sequencing protocol. Store the trimmed, merged, and demultiplexed FASTA files in a separate directory for each run.
After dereplication and harmonization of ASV identifiers, the
resulting read_count_df objects can be combined.
read_count_df <- rbind(read_count_df1, read_count_df2, read_count_df3) %>%
group_by(asv_id, sample, replicate, asv) %>%
summarize(read_count = sum(read_count), .groups = "drop")
This pooling step:
filter_occurrence_variant can eliminate most
occurrences of a frequent ASVThe filter_occurrence_variant function removes all
occurrences where the proportion of reads of an ASV in a
sample-replicate relative to its total read count is below the specified
cutoff:
(number of reads of the ASV in a sample-replicate) / (total number of read of the ASV) < cutoff
When
most occurrences of that ASV may fall below the cutoff, causing its total read count to drop sharply during this filtering step.
If the proportion of reads retained for an ASV after filtering is
less than min_read_count_prop, vtamR issues a warning. This
should be carefully inspected:
filter_occurrence_variant to retain more occurrences of the
ASV.When working with very large datasets, running SWARM on the entire dataset at once can cause memory issues.
Possible workarounds:
by_sample = TRUE).global_read_count to remove singletons before
SWARM.Both may slightly reduce clustering efficiency but allow the analysis to complete.
merge_fastq_pairs on WindowsIf very large input fastq files fail to merge, try using uncompressed input files.
merge_fastq_pairsIf your amplicon is shorter than the reads, enable staggered merging
by setting fastq_allowmergestagger = TRUE.
Certainly! Here’s a clearer, more tutorial-friendly rewrite of your section:
demultiplex_and_trimBy default, the demultiplex_and_trim function searches
for tags and primers only at the very ends of the reads
(tag_to_end = TRUE, primer_to_end = TRUE).
This setting is both efficient and appropriate for most use cases.
However, if you used sequencing spacers, the primers or tags might be located a few nucleotides away from the ends of your sequences. In such cases, you need to adjust these settings to ensure the function searches for tags and primers in the correct positions.
For more details, refer to the function documentation:
?demultiplex_and_trim.