vtamR is a revised, completed version of VTAM (Validation and Taxonomic Assignation of Metabarcoding Data) rewritten in R. It is a complete metabarcoding pipeline:
Novelties compared to VTAM:
vtamR
is highly adaptable to include/exclude and order different steps of the analysesA detailed Tutorial on how to construct a full metabarcoding pipeline is avalibale online.
Please, follow the Installation instructions.
Load library
library(vtamR)
Set path to third party programs
# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
blast_path <- "C:/Users/Public/blast-2.16.0+/bin/blastn"
swarm_path <- "C:/Users/Public/swarm-3.1.5-win-x86_64/bin/swarm"
# Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt"
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch"
blast_path <- "~/miniconda3/envs/vtam/bin/blastn"
swarm_path <- "swarm"
Set general parameters
num_threads <- 8
sep <- ","
outdir <- "vtamR_demo_out"
num_threads
: Number of CPUs for multithreaded programssep
: Separator used in csv filesoutdir
: Name of the output directory. Make sure there is no space in the path.Input Data
The demo files below are included with the vtamR
package, which is why we use system.file()
to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq
)
fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <- system.file("extdata/demo/fastqinfo.csv", package = "vtamR")
mock_composition <- system.file("extdata/demo/mock_composition.csv", package = "vtamR")
asv_list <- system.file("extdata/demo/asv_list.csv", package = "vtamR")
fastq_dir
: Directory containing the input fastq files.Taxassign database
Set the path to the database for taxonomic assignment and to the accompanying taxonomy file. For this example, we use a very small data base included in the vtamR
package. To see how to get a real data base visit TaxAssign reference data base. When using your own data just enter the file names, without using the system.file
function.
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
taxonomy
: CSV file with taxonomic informationblast_db
: BLAST databaseDetails are in Reference database for taxonomic assignments section
Check the format of input files
The CheckFileinfo
function tests if all obligatory columns are present, and makes some sanity checks. This can be helpful, since these files are produced by the users, and may contain errors difficult to spot by eye.
CheckFileinfo(file=fastqinfo, dir=fastq_dir, file_type="fastqinfo")
#> [1] "Coherence between samples and sample_type: OK"
#> [1] "Coherence between samples and habitat: OK"
#> [1] "sample_type: OK"
#> [1] "Coherence between samples and replicates: OK"
#> [1] "File extension: OK"
#> [1] "Coherence between fw and rv fastq filename: OK"
#> [1] "Unique tag combinations : OK"
CheckFileinfo(file=mock_composition, file_type="mock_composition")
#> [1] "Action types : OK"
CheckFileinfo(file=asv_list, file_type="asv_list")
#> [1] "1 to 1 relation between asv_id and asv : OK"
According to your wetlab and sequencing protocol each fastq file can contain one or more sample-replicates, and sequences may or may not contain tags (for demultiplexing) and primer sequences.
In this tutorial, the fastq files contain reads from several samples, so read pairs must be
Merge
)SortReads
)See the From fastq to data frame on how to deal with other experimental design.
Merge and quality filter
fastq_dir
: Directory containing the input fastq files.Merge
. It is the updated version of fastqinfo
, where fastq file names have been replaced by fasta file names.fasta_dir
: Directory containing the input fasta files for SortReads
. This directory is created by Merge
.See ?Merge
for the options of quality filtering.
merged_dir <- file.path(outdir, "merged")
fastainfo_df <- Merge(fastqinfo,
fastq_dir=fastq_dir,
vsearch_path=vsearch_path,
outdir=merged_dir,
fastq_maxee=1,
fastq_maxns=0,
fastq_allowmergestagger=F
)
Demultiplex and Trim
The SortReads
function will demultiplex the fasta files according to the tag combinations and trim the primers from the reads.
See the help (?SortReads
) for setting the correct parameters for demultiplexing and primer trimming:
check_reverse
is TRUE, SortReads
checks the reverse complementary stand as well.sortedinfo_df
is updated version of fastainfo
. This data frame and the files listed in it are the input of the Dereplicate
.sorted_dir <- file.path(outdir, "sorted")
sortedinfo_df <- SortReads(fastainfo_df,
fasta_dir=merged_dir,
outdir=sorted_dir,
check_reverse=TRUE,
cutadapt_path=cutadapt_path,
vsearch_path=vsearch_path
)
At this stage, you should have one FASTA file per sample-replicate, containing merged reads with tags and primers removed.
These FASTA files should now be dereplicated, and each unique sequence (ASV) will be assigned a numerical ID (asv_id). The result is a data frame (read_count_df) that can also be written to a CSV file if outfile
is specified. Saving this file is recommended, as it serves as the starting point for filtering. You can test different filtering strategies from this file without needing to re-run the time-consuming merging, demultiplexing, and trimming steps.
If you wish to retain ASV IDs from a previous analysis, you can provide a reference file (asv_list) containing earlier ASVs and their IDs. The complete, updated list of ASVs will then be written to updated_asv_list
.
If only updated_asv_list
is provided (and asv_list
is omitted), the function will write the full list of current ASVs with assigned IDs. This will be useful when analyzing further data sets to ensure consistent ASV IDs across studies.
outfile <- file.path(outdir, "1_before_filter.csv")
updated_asv_list <- file.path(outdir, "ASV_list_with_IDs.csv")
read_count_df <- Dereplicate(sortedinfo_df,
dir=sorted_dir,
outfile=outfile,
asv_list=asv_list,
updated_asv_list=updated_asv_list
)
vtamR
has a large number of functions to filter out ASVs or occurrences of ASVs in a given sample-replicate, which are the results of technical or biological problems associated with metabarcoding.
The different filters
PoolFilters
function (accepting only occurrences that pass all filters)In this section, I give an example using most filtering options. It is up to you to construct your own pipeline.
Each function returns a data frame with the filtered output. You can also write the results to a csv file. This can be useful for tracking the presence/absence of some of the ASVs/samples throughout the analyses using the HistoryBy and SummarizeBy functions.
I will start the names of the output files by a number to keep the order of the different steps and allow the SummarizeBy function to make final statistics at each step of the analyses.
We can also follow the evolution of the number of reads and ASVs remaining in the data set after each filtering steps. Let’s define a data frame (stat_df
), that will contain this information, and we will complete it after each step.
stat_df <- data.frame(parameters=character(),
asv_count=integer(),
read_count=integer(),
sample_count=integer(),
sample_replicate_count=integer())
Then we add a line with information on the unfiltered data set.
stat_df <- GetStat(read_count_df, stat_df, stage="Input", params=NA)
params
: String containing the major parameters of each filtering steps.stage
: String referring to the filtering step.Swarm (Mahé et al., 2015) is a powerful and quick computer program to denoise the data set by clustering ASVs that are likely to be the result of sequencing errors to ASVs representing real biological sequences. It will considerably reduce the number of ASVs.
By default, vtamR
uses 1 as swam’s d
and the fastidious
algorithm. See ?Swarm
on how to change this. It is possible to run Swarm
separately for each sample (by_sample=TRUE
; default) or at once for the whole data set. This second option is more efficient in reducing the number of ASVs but can cluster very similar real sequences appearing in different samples.
If you are interested in intra-specific variability, I suggest by_sample=TRUE
or skip this step, since Swarm
can cluster together very similar real ASV.
by_sample <- TRUE
outfile <- file.path(outdir, "2_Swarm_by_sample.csv")
read_count_df <- Swarm(read_count_df,
outfile=outfile,
swarm_path=swarm_path,
num_threads=num_threads,
by_sample=by_sample
)
stat_df <- GetStat(read_count_df, stat_df, stage="Swarm", params=by_sample)
Let’s check the reduction of the number of variants after running swarm
.
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
We can also plot the number of reads per variant.
Histogram_ReadCountByVariant(read_count_df, binwidth=100)
As we can see, most ASV have very low read counts.
Although, swarm has reduced considerably the number of ASVs, there are still many ASVs with low read count. The LFNglobalReadCount
function filter out all ASVs with total read count bellow a threshold.
Let’s eliminate singletons and see how the number of ASV and reads are have changed.
global_read_count_cutoff = 2
outfile <- file.path(outdir, "3_LFNglobalReadCount.csv")
read_count_df <- LFNglobalReadCount(read_count_df,
cutoff=global_read_count_cutoff,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="LFNglobalReadCount",
params=global_read_count_cutoff
)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
This filter is applicable only for coding sequences. The idea is that if the length of the ASV differs from the most frequent ASV length by a number that is not a multiple of 3, it is likely to be an erroneous sequence or a pseudo-gene.
outfile <- file.path(outdir, "4_FilterIndel.csv")
read_count_df <- FilterIndel(read_count_df, outfile=outfile)
stat_df <- GetStat(read_count_df, stat_df, stage="FilterIndel")
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
This filter is applicable only for coding sequences. It checks the number of codon STOPs in all 3 reading frames, and eliminates ASVs with STOP in all of them.
The numerical id of the genetic code can be chosen from NCBI. By default, it is the invertebrate mitochondrial genetic code (5), since the STOP codons of this genetic codes are also STOP codons in almost all genetic codes.
outfile <- file.path(outdir, "5_FilterCodonStop.csv")
genetic_code = 5
read_count_df <- FilterCodonStop(read_count_df, outfile=outfile, genetic_code=genetic_code)
stat_df <- GetStat(read_count_df, stat_df, stage="FilerCodonStop", params=genetic_code)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
This function will run the uchime3_denovo
function implemented in vsearch. It can be run sample by sample (by_sample=T
; default), or on the whole data set at once.
FilterChimera will eliminate ASVs detected as chimeras when running on the whole dataset at once (by_sample=F
). If by_sample=T
, ASVs are eliminated if they have been identified as chimeras in at least sample_prop
proportion of the samples where they are present.
abskew
: A chimera must be at least abskew
times less frequent that the parental ASVs.abskew=2
by_sample = T
sample_prop = 0.8
outfile <- file.path(outdir, "6_FilterChimera.csv")
read_count_df <- FilterChimera(read_count_df, outfile=outfile, vsearch_path=vsearch_path, by_sample=by_sample, sample_prop=sample_prop, abskew=abskew)
params <- paste(abskew, by_sample, sample_prop, sep=";")
stat_df <- GetStat(read_count_df, stat_df, stage="FilterChimera", params=params)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
Let’s eliminate aberrant replicates that are not similar to other replicates of the same sample.
We will calculate the renkonen distances among all pairs of replicates within sample.
renkonen_within_df <- MakeRenkonenDistances(read_count_df, compare_all=FALSE)
Let’s make a density plot of the renkonen distances
DensityPlot_RenkonenDistance(renkonen_within_df)
We can also make a barplot of the renkonen distances, and use different colors for different sample types (real/mock/negative).
Note:
sortedinfo_df
contains a sample
and sample_type
columns, but any data frame with these columns would do.SortReads
function, andsortedinfo_df
is not in your environment, just use the sortedinfo file (in the vtamR_demo_out/sorted/
directory).Barplot_RenkonenDistance(renkonen_within_df, sample_types=sortedinfo_df, x_axis_label_size=6)
In this example, we have only a few distances, so the density plot is not very relevant. The barplot indicates that the replicate 1 of the tnegtag
sample (negative control) is distant from the other two replicates. Lets, establish the Renkonen cutoff value to 0.4. We will filter out replicates that have Renkonen distances above this cutoff to most other replicates of the sample. In this case, replicate 1 of the tnegtag sample will be eliminated.
outfile <- file.path(outdir, "7_FilterRenkonen.csv")
cutoff <- 0.4
read_count_df <- FilterRenkonen(read_count_df, outfile=outfile, cutoff=cutoff)
stat_df <- GetStat(read_count_df, stat_df, stage="FilerRenkonen", params=cutoff)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
If you have done the denoising step by swarm, this function probably redundant.
FilterPCRerror
filters out an ASV if it is very similar to another, more frequent ASV. max_mismatch
argument tells how many mismatches are allowed between a pair of ASV to be evaluated for PCR error. If the proportion of the read_counts of two similar ASVs is less or equal to pcr_error_var_prop
, the less abundant ASV is flagged as a PCR error.
OptimizePCRerror
The default value of pcr_error_var_prop
(0.1) is arbitrary. To choose a value adapted to your data set, you can use the OptimizePCRerror
function.
OptimizePCRerror
, will find all highly similar ASV pairs (max_mismatch=1
by default) within a mock sample, where one ASV is expected, and the other is not. Their read_count
ratio is printed in the output file.
The function considers only ASVs with more than min_read_count
reads in the sample to avoid a ratio based on low read counts that are more influenced by stochastic events and will be probably filtered out anyway.
See details using the help: ?OptimizePCRerror
.
Note: I use max_mismatch=2
in this example, to have some output in my very small (and artificial) data set, but I recommend you to use max_mismatch=1
.
Note: If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.
outfile <- file.path(outdir, "OptimizePCRerror.csv")
OptimizePCRerror_df <- OptimizePCRerror(read_count_df,
mock_composition=mock_composition,
vsearch_path=vsearch_path,
outfile=outfile,
max_mismatch=2,
min_read_count=5
)
Let’s see the beginning of the output:
knitr::kable(head(OptimizePCRerror_df), format = "markdown")
sample | expected_read_count | unexpected_read_count | pcr_error_var_prop | expected_asv_id | unexpected_asv_id | expected_asv | unexpected_asv |
---|---|---|---|---|---|---|---|
tpos1 | 1022 | 14 | 0.0136986 | 1688 | 5765 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | CCTTTATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
tpos1 | 1022 | 5 | 0.0048924 | 1688 | 1682 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTGATACCTATTATAATT |
tpos1 | 12965 | 12 | 0.0009256 | 2009 | 6042 | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGATCATGGCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
FilterPCRerror
It seems that Swarm has done a good job. The highest read count ratio is 0.013 in the output of OptimizePCRerror
. This should be taken as a lower limit to pcr_error_var_prop
. Let’s use 0.05 for filtering out PCR errors sample by sample.
See the help for more detail ?FilterPCRerror
.
Note: Again, I use max_mismatch <- 2
in this example, since I have used this value for OptimizePCRerror
, but I recommend you to use max_mismatch <- 1
.
pcr_error_var_prop <- 0.05
max_mismatch <- 2
outfile <- file.path(outdir, "8_FilterPCRerror.csv")
read_count_df <- FilterPCRerror(read_count_df,
outfile=outfile,
vsearch_path=vsearch_path,
pcr_error_var_prop=pcr_error_var_prop,
max_mismatch=max_mismatch
)
params <- paste(pcr_error_var_prop, max_mismatch, by_sample, sep=";")
stat_df <- GetStat(read_count_df, stat_df, stage="FilterPCRerror", params=params)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate
will eliminate occurrences with very low read counts compared to the total number of reads in the sample-replicate. The default cutoff proportion is 0.001. We can have an idea of the maximum value of this cutoff, by examining the proportions of expected ASVs in the mock samples using the OptimizeLFNsampleReplicate
function.
OptimizeLFNsampleReplicate
Note: If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.
outfile = file.path(outdir, "OptimizeLFNsampleReplicate.csv")
OptimizeLFNsampleReplicate_df <- OptimizeLFNsampleReplicate(read_count=read_count_df,
mock_composition=mock_composition,
outfile=outfile)
knitr::kable(head(OptimizeLFNsampleReplicate_df), format = "markdown")
sample | replicate | action | asv_id | read_count | read_count_sample_replicate | lfn_sample_replicate_cutoff | asv | taxon |
---|---|---|---|---|---|---|---|---|
tpos1 | 3 | keep | 120 | 118 | 23972 | 0.0049 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
tpos1 | 1 | keep | 708 | 189 | 21939 | 0.0086 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
tpos1 | 2 | keep | 708 | 244 | 27584 | 0.0088 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
tpos1 | 2 | keep | 120 | 252 | 27584 | 0.0091 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla |
tpos1 | 3 | keep | 708 | 228 | 23972 | 0.0095 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus |
tpos1 | 2 | keep | 1688 | 331 | 27584 | 0.0119 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | Synorthocladius semivirens |
The lowest proportion of read count of an expected ASV to the total read count of its sample-relicate is 0.0049 (Sequence of Caenis pusilla in the replicate 3 of tpos1 mock sample). A cutoff values for LFNsampleReplicate
higher than this will eliminate some of the expected occurrences and create false negatives. Therefore, we choose 0.004 as a cutoff.
LFNsampleReplicate
lfn_sample_replicate_cutoff <- 0.004
outfile <- file.path(outdir, "9_LFNsampleReplicate.csv")
read_count_df <- LFNsampleReplicate(read_count_df,
cutoff=lfn_sample_replicate_cutoff,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="LFNsampleReplicate",
params=lfn_sample_replicate_cutoff
)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate | 0.004 | 60 | 120784 | 4 | 11 |
To ensure repeatability, we can accept occurrences if they are present in at least min_replicate_number
replicates of the sample (2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "10_FilterMinReplicate.csv")
read_count_df <- FilterMinReplicate(read_count_df,
min_replicate_number,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterMinReplicate",
params=min_replicate_number
)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate | 0.004 | 60 | 120784 | 4 | 11 |
FilterMinReplicate | 2 | 38 | 119435 | 4 | 11 |
The LFNvariant
filter will eliminate occurrences with very low read counts compared to the total number of reads of the ASV. The default cutoff proportion is 0.001. This filter is designed to filter out occurrences present in the data set due to tag-jump or light inter sample contamination.
The LFNreadCount
filter simply eliminates occurrences with read counts bellow a cutoff
(10 by default).
To find the best cutoff values for these two filters, the OptimizeLFNreadCountLFNvariant
will count the number of false positive, false negative and true positive occurrences using a series of combination of cutoff values of these two filters. To run this function, we need a known_occurrences_df data frame that lists all occurrences that are known to be a FP, TP.
MakeKnownOccurrences
The MakeKnownOccurrences
function will identify false positive, false negative and true positive occurrences in controls samples (mock and negative). Some false positives can also be identified in real samples if samples of different habitats are included in the data sets.
The false positive and the true positive occurrences are written to the known_occurrences
data frame (or file), false negatives to the missing_occurrences
data frame (or file), and a performance_metrics
data frame is also produced with the count of these occurrences. For details see ?MakeKnownOccurrences
.
This function takes a read_count_samples
data frame as an input, where the replicates of the same sample have been pooled (see ?PoolReplicates
)
Note:
sample
and sample_type
replicate
and habitat
columns, but any data frame with these columns would do.SortReads
function, andsortedinfo_df
is not in your environment, just use the sortedinfo file (in the vtamR_demo_out/sorted
directory).# Pool replicates
read_count_samples_df <- PoolReplicates(read_count_df)
# Detect known occurrences
results <- MakeKnownOccurrences(read_count_samples = read_count_samples_df,
sortedinfo=sortedinfo_df,
mock_composition=mock_composition
)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]
OptimizeLFNreadCountLFNvariant
The LFNreadCount
and LFNvariant
functions are run for a series of cutoff value combinations of the two filters, followed by FilterMinReplicate
. For each parameter combination, the number of FN, TP, and FP is reported. Chose the parameter setting that minimizes, FN and FP.
Here we will use the default range of cutoffs to test, but you can set the minimum, the maximum and the increment for the cutoff values for both filters. (see ?OptimizeLFNreadCountLFNvariant
). We set min_replicate_number
to 2, to eliminate non-repeatable occurrences among the three replicates of each sample.
outfile = file.path(outdir, "OptimizeLFNreadCountLFNvariant.csv")
OptimizeLFNreadCountLFNvariant_df <- OptimizeLFNreadCountLFNvariant(
read_count_df,
known_occurrences=known_occurrences_df,
outfile= outfile,
min_replicate_number=2
)
#head(OptimizeLFNreadCountLFNvariant_df)
knitr::kable(head(OptimizeLFNreadCountLFNvariant_df), format = "markdown")
lfn_read_count_cutoff | lnf_variant_cutoff | FN | TP | FP |
---|---|---|---|---|
10 | 0.001 | 0 | 6 | 4 |
15 | 0.001 | 0 | 6 | 4 |
20 | 0.001 | 0 | 6 | 4 |
25 | 0.001 | 0 | 6 | 4 |
30 | 0.001 | 0 | 6 | 4 |
35 | 0.001 | 0 | 6 | 4 |
LFNvariant, LFNreadCount
From the output, we choose 0.001 for the cutoff of LFNvariant
and 10 for LFNreadCount
, since this is the less stringent combination that keeps all expected occurrences (6 TP, 0 FN), and has less FP (4 FP).
We will run the two filters on the same input data frame (for which the parameters has been optimized), and pool the results by accepting only occurrences that pass both filters by the PoolFilters
function.
See ?LFNvariant, ?LFNreadCount
and ?PoolFilters
for details.
LFNvariant
lnf_variant_cutoff = 0.001
outfile <- file.path(outdir, "11_LFNvariant.csv")
read_count_df_lnf_variant <- LFNvariant(read_count_df,
cutoff=lnf_variant_cutoff,
outfile=outfile
)
stat_df <- GetStat(read_count_df_lnf_variant,
stat_df,
stage="LFNvariant",
params=lnf_variant_cutoff)
LFNreadCount
lfn_read_count_cutoff <- 10
outfile <- file.path(outdir, "12_LFNreadCount.csv")
read_count_df_lfn_read_count <- LFNreadCount(read_count_df,
cutoff=lfn_read_count_cutoff,
outfile=outfile
)
stat_df <- GetStat(read_count_df_lfn_read_count,
stat_df, stage="LFNreadCount",
params=lfn_read_count_cutoff
)
Combine results
outfile <- file.path(outdir, "13_poolLFN.csv")
read_count_df <- PoolFilters(read_count_df_lfn_read_count,
read_count_df_lnf_variant,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterLFN"
)
# delete temporary data frames
rm(read_count_df_lfn_read_count)
rm(read_count_df_lnf_variant)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate | 0.004 | 60 | 120784 | 4 | 11 |
FilterMinReplicate | 2 | 38 | 119435 | 4 | 11 |
LFNvariant | 0.001 | 38 | 119366 | 4 | 11 |
LFNreadCount | 10 | 34 | 119372 | 4 | 11 |
FilterLFN | NA | 34 | 119320 | 4 | 10 |
Let’s run again FilterMinReplicate
to ensure repeatability among replicates of the sample (2 by default).
min_replicate_number <- 2
outfile <- file.path(outdir, "14_FilterMinReplicate.csv")
read_count_df <- FilterMinReplicate(read_count_df,
min_replicate_number,
outfile=outfile
)
stat_df <- GetStat(read_count_df,
stat_df,
stage="FilterMinReplicate",
params=min_replicate_number
)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate | 0.004 | 60 | 120784 | 4 | 11 |
FilterMinReplicate | 2 | 38 | 119435 | 4 | 11 |
LFNvariant | 0.001 | 38 | 119366 | 4 | 11 |
LFNreadCount | 10 | 34 | 119372 | 4 | 11 |
FilterLFN | NA | 34 | 119320 | 4 | 10 |
FilterMinReplicate1 | 2 | 31 | 119147 | 3 | 9 |
Let’s pool replicates of the same sample. PoolReplicates
function will take the mean non-zero read counts of each ASV over replicates of the same sample.
outfile <- file.path(outdir, "15_PoolReplicates.csv")
read_count_samples_df <- PoolReplicates(read_count_df,
outfile=outfile
)
stat_df <- GetStat(read_count_samples_df,
stat_df,
stage="PoolReplicates"
)
knitr::kable(stat_df, format = "markdown")
parameters | asv_count | read_count | sample_count | sample_replicate_count | |
---|---|---|---|---|---|
Input | NA | 10461 | 123530 | 4 | 12 |
Swarm | TRUE | 418 | 123530 | 4 | 12 |
LFNglobalReadCount | 2 | 147 | 123236 | 4 | 12 |
FilterIndel | NA | 143 | 123196 | 4 | 12 |
FilerCodonStop | 5 | 143 | 123196 | 4 | 12 |
FilterChimera | 2;TRUE;0.8 | 122 | 122619 | 4 | 12 |
FilerRenkonen | 0.4 | 122 | 122570 | 4 | 11 |
FilterPCRerror | 0.05;2;TRUE | 117 | 122531 | 4 | 11 |
LFNsampleReplicate | 0.004 | 60 | 120784 | 4 | 11 |
FilterMinReplicate | 2 | 38 | 119435 | 4 | 11 |
LFNvariant | 0.001 | 38 | 119366 | 4 | 11 |
LFNreadCount | 10 | 34 | 119372 | 4 | 11 |
FilterLFN | NA | 34 | 119320 | 4 | 10 |
FilterMinReplicate1 | 2 | 31 | 119147 | 3 | 9 |
PoolReplicates | NA | 31 | 40041 | 3 | NA |
Run MakeKnownOccurrences
again to get performance metrics (FP, FN, FP). This time we will write the output data frames to files as well. The performance_metrics
file will give you the count of FP, FN, TP, accuracy and sensitivity. You can find false negatives in missing_occurrences
, and true and false positives in known_occurrences
.
missing_occurrences <- file.path(outdir, "missing_occurrences.csv")
performance_metrics <- file.path(outdir, "performance_metrics.csv")
known_occurrences <- file.path(outdir, "known_occurrences.csv")
results <- MakeKnownOccurrences(read_count_samples_df,
sortedinfo=sortedinfo_df,
mock_composition=mock_composition,
known_occurrences=known_occurrences,
missing_occurrences=missing_occurrences,
performance_metrics=performance_metrics
)
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]
knitr::kable(performance_metrics_df, format = "markdown")
TP | FP | FN | Precision | Sensitivity |
---|---|---|---|---|
6 | 4 | 0 | 0.6 | 1 |
If you have already assigned ASV to taxa, (before making the mock_composition file) you can use the same output (asv_tax
data frame) and skip this step.
If you haven’t assigned ASV to taxa yet it is time to do it.
See the brief description of the algorithm with ?TaxAssign
.
For the format of taxonomy
and blast_db
check the Reference database for taxonomic assignments section. You can also download a ready to use COI database (See Installation).
outfile <- file.path(outdir, "TaxAssign.csv")
asv_tax <- TaxAssign(asv=read_count_samples_df,
taxonomy=taxonomy,
blast_db=blast_db,
blast_path=blast_path,
outfile=outfile,
num_threads=num_threads
)
WriteASVtable
will reorganize the read_count_samples_df
data frame, with samples in columns and ASV in lines (from long format to wide format).
It is possible to add supplementary information as well:
asv_tax
)add_sums_by_asv
)add_sums_by_sample
)add_expected_asv
)add_empty_samples
)For more information see ?WriteAsVtable
outfile=file.path(outdir, "Final_asvtable_with_TaxAssign.csv")
asv_table_df <- WriteASVtable(read_count_samples_df,
outfile=outfile,
asv_tax=asv_tax,
sortedinfo=sortedinfo_df,
add_empty_samples=T,
add_sums_by_sample=T,
add_sums_by_asv=T,
add_expected_asv=T,
mock_composition=mock_composition
)
Print the number of reads, ASVs, samples and replicates after each step to a CSV file.
write.csv(stat_df, file = file.path(outdir, "stat_steps.csv"))
Random select n
sequences from each input fasta file. It can be used before or after demultiplexing (SortReads
). However, if using LFNvariant
and LFNreadCount
which are partially based on the number of reads in negative controls, standardizing the number of reads among samples does not make sense. Thus I use RandomSeq
after Merge
, to get the same number of reads for each replicate series (same samples, different replicates in each fasta file).
fasta_dir
: Directory containing the input fasta files.n
: Number of sequences to be taken randomly (without replacement).randomseq_dir = file.path(outdir, "random_seq")
fastainfo_df <- RandomSeq(fastainfo_df,
fasta_dir=merged_dir,
outdir=randomseq_dir,
vsearch_path=vsearch_path,
n=10000
)
#> [1] "mfzr_1_fw.fasta"
#> [1] "mfzr_2_fw.fasta"
#> [1] "mfzr_3_fw.fasta"
This function scans all files in the dir
that starts by a number. (See file names of the output files of the different filtering steps). It will select all lines were the feature
(asv_id/asv/sample/replicate) has a value
we are looking for.
Examples
Get the history of asv_id (feature
) 27 (value
).
tmp_ASV_27 <- HistoryBy(dir=outdir,
feature="asv_id",
value="27"
)
knitr::kable(head(tmp_ASV_27), format = "markdown")
file | asv_id | sample | replicate | read_count | asv |
---|---|---|---|---|---|
1_before_filter.csv | 27 | tpos1 | 1 | 131 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
1_before_filter.csv | 27 | tpos1 | 2 | 257 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
1_before_filter.csv | 27 | tpos1 | 3 | 89 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
2_Swarm_by_sample.csv | 27 | tpos1 | 1 | 178 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
2_Swarm_by_sample.csv | 27 | tpos1 | 2 | 349 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
2_Swarm_by_sample.csv | 27 | tpos1 | 3 | 117 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
Get the history of a sequence of the ASV (feature
).
tmp_replicate_1 <- HistoryBy(dir=outdir,
feature="asv",
value="CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT"
)
knitr::kable(tmp_replicate_1, format = "markdown")
file | asv_id | sample | replicate | read_count | asv |
---|---|---|---|---|---|
1_before_filter.csv | 1589 | tpos1 | 1 | 23 | CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
1_before_filter.csv | 1589 | tpos1 | 2 | 23 | CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
1_before_filter.csv | 1589 | tpos1 | 3 | 31 | CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
Get the history of the sample (feature
) tpos1 (value
).
tmp_sample_tpos1 <- HistoryBy(dir=outdir,
feature="sample",
value="tpos1"
)
knitr::kable(head(tmp_sample_tpos1), format = "markdown")
file | asv_id | sample | replicate | read_count | asv |
---|---|---|---|---|---|
1_before_filter.csv | 7 | tpos1 | 1 | 6 | AATGTATCTAATCTTTGGAGGTTTTTCTGGTATTATTGGAACAGCTTTATCTATTTTAATCAGAATAGAATTATCGCAACCAGGAAACCAAATTTTAATGGGAAACCATCAATTATATAATGTAATTGTAACTTCTCACGCTTTTATTATGATTTTTTTTATGGTAATGCCAATTTTATTA |
1_before_filter.csv | 8 | tpos1 | 1 | 3 | ACCTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
1_before_filter.csv | 9 | tpos1 | 1 | 1 | ACTACACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
1_before_filter.csv | 10 | tpos1 | 1 | 1 | ACTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCCATGATT |
1_before_filter.csv | 11 | tpos1 | 1 | 1 | ACTATACCTTACCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGGAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
1_before_filter.csv | 12 | tpos1 | 1 | 1 | ACTATACCTTATCTTCGCAGTATTCTCAAGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
This function scans all files in the dir
that starts by a number. (See file names of the output files of the different filtering steps).
It will group each file by a variable (asv/asv_id/sample/replicate) and summarize a feature
(asv/asv_id/sample/replicate/read_count). If the feature
is read_count
, it will give the sum of the read counts for each value of the variable in each file. Otherwise, it returns the number of distinct values of the feature
for each value of the variable in each file.
Examples
Get the number of reads of each sample after each filtering steps.
From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and there is a considerable variation among the number of reads of different real samples.
read_count_by_sample <- SummarizeBy(dir=outdir,
feature="read_count",
grouped_by="sample"
)
knitr::kable(read_count_by_sample, format = "markdown")
sample | 1_before_filter | 2_Swarm_by_sample | 3_LFNglobalReadCount | 4_FilterIndel | 5_FilterCodonStop | 6_FilterChimera | 7_FilterRenkonen | 8_FilterPCRerror | 9_LFNsampleReplicate | 10_FilterMinReplicate | 11_LFNvariant | 12_LFNreadCount | 13_poolLFN | 14_FilterMinReplicate | 15_PoolReplicates |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14ben01 | 28115 | 28115 | 28079 | 28079 | 28079 | 28067 | 28067 | 28067 | 27877 | 27799 | 27799 | 27799 | 27799 | 27799 | 9288 |
14ben02 | 20750 | 20750 | 20683 | 20664 | 20664 | 20664 | 20664 | 20664 | 20134 | 19206 | 19206 | 19206 | 19206 | 19206 | 6583 |
tnegtag | 406 | 406 | 363 | 363 | 363 | 354 | 305 | 305 | 305 | 288 | 219 | 225 | 173 | 0 | 0 |
tpos1 | 74259 | 74259 | 74111 | 74090 | 74090 | 73534 | 73534 | 73495 | 72468 | 72142 | 72142 | 72142 | 72142 | 72142 | 24170 |
Get the number asv for each sample after each filtering steps.
From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and the mock samples has 10 ASVs at the end.
asv_by_sample <- SummarizeBy(dir=outdir,
feature="asv",
grouped_by="sample"
)
knitr::kable(asv_by_sample, format = "markdown")
sample | 1_before_filter | 2_Swarm_by_sample | 3_LFNglobalReadCount | 4_FilterIndel | 5_FilterCodonStop | 6_FilterChimera | 7_FilterRenkonen | 8_FilterPCRerror | 9_LFNsampleReplicate | 10_FilterMinReplicate | 11_LFNvariant | 12_LFNreadCount | 13_poolLFN | 14_FilterMinReplicate | 15_PoolReplicates |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14ben01 | 2006 | 59 | 24 | 24 | 24 | 22 | 22 | 22 | 5 | 3 | 3 | 3 | 3 | 3 | 3 |
14ben02 | 3002 | 122 | 59 | 57 | 57 | 57 | 57 | 57 | 35 | 20 | 20 | 20 | 20 | 20 | 20 |
tnegtag | 182 | 61 | 23 | 23 | 23 | 22 | 21 | 21 | 21 | 13 | 10 | 6 | 5 | 0 | 0 |
tpos1 | 5579 | 212 | 76 | 74 | 74 | 56 | 56 | 51 | 13 | 10 | 10 | 10 | 10 | 10 | 10 |
Get the number asv_id for each replicate after each filtering steps.
We can see that number of ASVs are comparable in different replicates.
asvid_by_replicate <- SummarizeBy(dir=outdir,
feature="asv_id",
grouped_by="replicate"
)
#> [1] "WARNING: replicate variable is not present in vtamR_demo_out/15_PoolReplicates.csv"
knitr::kable(asvid_by_replicate, format = "markdown")
replicate | 1_before_filter | 2_Swarm_by_sample | 3_LFNglobalReadCount | 4_FilterIndel | 5_FilterCodonStop | 6_FilterChimera | 7_FilterRenkonen | 8_FilterPCRerror | 9_LFNsampleReplicate | 10_FilterMinReplicate | 11_LFNvariant | 12_LFNreadCount | 13_poolLFN | 14_FilterMinReplicate |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 4392 | 211 | 115 | 114 | 114 | 95 | 90 | 87 | 37 | 28 | 28 | 28 | 28 | 28 |
2 | 4485 | 231 | 112 | 111 | 111 | 90 | 90 | 86 | 47 | 36 | 36 | 32 | 32 | 29 |
3 | 4821 | 160 | 92 | 88 | 88 | 77 | 77 | 76 | 35 | 33 | 33 | 26 | 26 | 26 |
Pools unique asv
- asv_id
combinations from the input data frame (read_count_df
) and from the input file (asv_list
).
The input file is typically a CSV file containing ASVs seen in earlier data sets with their asv_id
. If there is a conflict within or between the input data the function quits with an error message. Otherwise writes the updated_asv_list
to the outfile.
The safest option of avoiding incoherence between asv_ids
of earlier and present runs is to use the asv_list
and updated_asv_list
parameters in the Dereplicate
function as it is done in this tutorial. This will synchronize the asv_id
in this run with earlier ones, and writes an updated file with all ASV from earlier and the present data set. In this case, calling the UpdateASVlist
function is not necessary, since it is automatically called from Dereplicate
.
However, if you are analyzing a large number of very large data sets, the complete asv_list
will grow quickly, and might cause memory issues. Most of the ASVs in this list are singletons, and filtered out during the analyses. Therefore, you can opt for homogenizing the asv_id
s with earlier runs by the Dereplicate
function, without writing an updated_asv_list
. Then update the asv_list
after the first steps of filtering (e.g. Swarm
, LFNglobalReadCount
) to keep only ASVs that are more frequent and more likely to appear in future data sets. It is still a quite safe option, and reduces greatly the number of ASVs kept in this file.
updated_asv_list <- file.path(outdir,
"updated_ASV_list.csv"
)
UpdateASVlist(read_count_df,
asv_list=asv_list,
outfile=updated_asv_list
)
More than one overlapping marker
This function pools different data sets and it is particularly useful, if the results should be pooled from more than one overlapping markers. In that case, ASVs identical on their overlapping regions are pooled into groups, and different ASVs of the same group are represented by their centroid (longest ASV of the group). Pooling can take the mean read counts of the ASVs (mean_over_markers=T
; default) or their sum (mean_over_markers=F
).
The function takes several input CSV files, each in long format containing asv_id
, sample
, read_count
and asv
columns. The file names should be organized in data frame, with the marker names for each file.
outfile
: Name of the output CSV file with the pooled data set (asv_id
, sample
, read_count
, asv
). ASVs are grouped to the same line if identical in their overlapping region, and only the centroids appear in the asv
column.asv_with_centroids
: Name of the output CSV file containing each of the the original ASVs (with samples, markers, and read_count) as well as their centroids.outfile
.files <- data.frame(file=c("vtamR_test/out_mfzr/15_PoolReplicates.csv",
"vtamR_test/test/15_PoolReplicates_ZFZR.csv"),
marker=c("MFZR",
"ZFZR")
)
outfile <- file.path(outdir, "Pooled_datasets.csv")
asv_with_centroids <- file.path(outdir,
"Pooled_datasets_asv_with_centroids.csv"
)
read_count_pool <- PoolDatasets(files,
outfile=outfile,
asv_with_centroids=asv_with_centroids,
mean_over_markers=T,
vsearch_path=vsearch_path
)
Only one marker
Pooling the results of different data sets of the same marker is very simple. Basically, the input files (in long format with asv
, asv_id
, sample
, read_count
columns) are concatenated. The PoolDatasets
function also checks if sample names are unique. If not, it sums the read count of the same sample and same ASV, but returns a warning.
The output file or data frame can be rearranged to wide format by the WriteASVtable function.
Count the number of reads in fasta
or fastq
files found in the input directory. Input files can be gz compressed or uncompressed, but zip files are not supported.
The fastainfo and the sortedinfo files contain the number of reads after Merge
or SortReads
, so no need to run CountReadsDir
separately. This function can be useful for counting the number of reads in the input fastq
files.
dir
: Input directory containing the fasta of fastq filesfile_type
: [fasta/fastq]pattern
: Regular expression; Check only files for pattern
in the file namedir <- "vtamR_test/data"
df <- CountReadsDir(dir,
pattern="_fw.fastq.gz",
file_type="fastq"
)
Make a bar plot of read counts by sample (sample_replicate=F
) or sample-replicate (sample_replicate=T
). Can use different colors for different sample types (real/mock/negative)
sample_types
: data frame or CSV file containing info on sample types for each samplesortedinfo <- file.path(sorted_dir, "sortedinfo.csv")
Barplot_ReadCountBySample(read_count_df=read_count_df,
sample_replicate=F,
sample_types=sortedinfo
)
Histogram of read counts by ASV
min_read_count
: Ignore variants with read count bellow this valuebinwidth
: Width of binsHistogram_ReadCountByVariant(read_count_df,
min_read_count=10,
binwidth=1000
)
Calculate the Renkonen distances among all replicates (compare_all=TRUE) or among replicates of the same sample (compare_all=FALSE) and plot them.
sample_types
: Data frame or CSV file containing info on sample types for each sample.renkonen_within_df <- MakeRenkonenDistances(read_count_df, compare_all=FALSE)
Barplot_RenkonenDistance(renkonen_within_df, sample_types=sortedinfo)
DensityPlot_RenkonenDistance(renkonen_within_df)
renkonen_all_df <- MakeRenkonenDistances(read_count_df, compare_all=TRUE)
DensityPlot_RenkonenDistance(renkonen_all_df)
CSV file with information on input fastq files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with information on input fasta files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with information on demultiplexed and primer trimmed fasta files and samples with the following columns. Each line corresponds to a sample-replicate combination.
CSV file with the following columns.
CSV file or data frame with the following columns.
This file lists ASVs and their IDs from earlier data sets. When provided (optional), identical ASVs in the present and earlier data sets have the same ID. New ASVs (not present in asv_list) will get unique IDs not present in asv_list. It is a CSV file with the following columns:
Data frame with the following columns:
A data base is composed of two elements. A BLAST database (blast_db
) and a taxonomy
file.
blast_db
can be produced using the makeblastdb
command of BLAST:
makeblastdb -dbtype nucl -in [FASTA_FILE] -parse_seqids -taxid_map [TAXID_FILE] -out [DB_NAME]
taxonomy
is a tab separated csv file with the following columns:
A ready to use COI database in BLAST format and the associated taxonomy file can be downloaded from https://osf.io/vrfwz/. It was created using mkCOInr. It is also possible to make a customized database using mkCOInr. It can be dowloaded and extracted manually of using the download_osf
function of vtamR. See Installation
compress
: If TRUE, output files of the Merge
, RandomSeq
and SortReads
functions are compressed. This saves space, but compressing/decompressing increases run time in some cases. I suggest to avoid compressing intermediate files and delete/compress them as soon as the analyses are finished. (See more in the troubleshooting section). The input fasta and fastq files of these functions can be compressed (.gz
, .bz2
, but NOT .zip) or uncompressed files and compression is automatically detected based on the file extension.
delete_tmp
: Delete automatically intermediate files. TRUE by default.
sep
: Separator used in csv files. Use the same separator in all CSV files except for the taxonomy file of the reference database, which is tab separated.
quiet
: If TRUE print as little information to the terminal as possible. TRUE by default.
These are temporary directories, that are automatically deleted at the end of the function. In some cases, if the function ends prematurely due to an error, the directory is not deleted. Just delete them manually.
LFNvariant
will filter out all occurrences where
(number of reads of the ASV in a sample-replicate) / (total number of read of the ASV) < cutoff
As a consequence, if an ASV is present in most samples, there are many samples in the data set and the cutoff is relatively high, most occurrences can fall bellow the cutoff and the total read count of the ASV will decrease strongly at this step.
If the proportion of the read count of an ASV in the output compared to the input is less than min_read_count_prop
, vtamR prints out a Warning. Take a close look at the variant. It can be the result of a general contamination, and in this case, the whole ASV could/should be eliminated from the data set. Otherwise, you might want to reduce the cutoff value for LFNvariant
.
Possible workarounds:
by_sample=T
).global_read_count
first to eliminate singletons then swarm.Both solutions make swarm a bit less efficient, but at least your analyses can get through.
If input fastq files are very large and do not go through Merge
, try working with uncompressed input files.
At the first steps (Merge, RandomSeq, Sortreads), many files are produced, and some of them can be very large, especially before the dereplication (before Dereplicate
).
Compressing and decompressing between the different steps can take a while and can have memory issues in some cases (especially on Windows). Therefore, vtamR
uses by default uncompressed files. This behavior can be changed by the compress option.
Here is a list of files I would tend to keep, delete or compress.
Merge
: Delete the output fasta files, but keep fastainfo.csv
for information on read counts after Merge
.RandomSeq
: Compress fasta files, but keep them to ensure reproducibility in case of re-analyse.SortReads
: Delete the output fasta files, but keep sortedinfo.csv
for information on read counts after SortReads
.Dereplicate
: Write the dereplicated info to a file (1_before_filter.csv
in the tutorial) and compress it after the analyses. This is an important file, since you can restart the analyses from here, without re-doing the longest merging, and demultiplexing steps. Compress and keep the updated ASV list (ASV_list_with_IDs.csv
in the tutorial), since this can be used when analyzing subsequent data sets to synchronyse the asv_ids.If the amplicon is shorter than the reads, use fastq_allowmergestagger=T
.