Summary

vtamR is a revised, completed version of VTAM (Validation and Taxonomic Assignation of Metabarcoding Data) rewritten in R. It is a complete metabarcoding pipeline:

  • Sequence analyses from raw fastq files of amplicon sequences till Amplicon Sequence Variant (ASV) table of validated ASVs assigned to taxonomic groups.
  • Handles technical or biological replicates of the same sample
  • Uses positive and negative control samples to fine tune the filtering and reduce false positive and false negative occurrences.
  • Can pool multiple data sets (results of earlier analyses)
  • Can pool results from overlapping markers

Novelties compared to VTAM:

  • As it is a series of R functions, vtamR is highly adaptable to include/exclude and order different steps of the analyses
  • Includes swarm for denoising
  • Graphic options
  • Include functions to get statistics of each filtering steps (read and variant count etc.)
  • The notion of marker and run has been dropped to simplify the analyses

A detailed Tutorial on how to construct a full metabarcoding pipeline is avalibale online.

Installation

Please, follow the Installation instructions.

Tutorial

Set up

Load library

library(vtamR)

Set path to third party programs

# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
blast_path <- "C:/Users/Public/blast-2.16.0+/bin/blastn"
swarm_path <- "C:/Users/Public/swarm-3.1.5-win-x86_64/bin/swarm"
#  Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt"
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch"
blast_path <- "~/miniconda3/envs/vtam/bin/blastn"
swarm_path <- "swarm"
  • Adapt the path to third party programs according to your installation (See Installation).

Set general parameters

num_threads <- 8
sep <- ","
outdir <- "vtamR_demo_out"
  • num_threads: Number of CPUs for multithreaded programs
  • sep: Separator used in csv files
  • outdir: Name of the output directory. Make sure there is no space in the path.

Input Data

The demo files below are included with the vtamR package, which is why we use system.file() to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq)

fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <-  system.file("extdata/demo/fastqinfo.csv", package = "vtamR")
mock_composition <-  system.file("extdata/demo/mock_composition.csv", package = "vtamR")
asv_list <-  system.file("extdata/demo/asv_list.csv", package = "vtamR")

Taxassign database

Set the path to the database for taxonomic assignment and to the accompanying taxonomy file. For this example, we use a very small data base included in the vtamR package. To see how to get a real data base visit TaxAssign reference data base. When using your own data just enter the file names, without using the system.file function.

taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
  • taxonomy: CSV file with taxonomic information
  • blast_db: BLAST database

Details are in Reference database for taxonomic assignments section

Check the format of input files

The CheckFileinfo function tests if all obligatory columns are present, and makes some sanity checks. This can be helpful, since these files are produced by the users, and may contain errors difficult to spot by eye.

CheckFileinfo(file=fastqinfo, dir=fastq_dir, file_type="fastqinfo")
#> [1] "Coherence between samples and sample_type: OK"
#> [1] "Coherence between samples and habitat: OK"
#> [1] "sample_type: OK"
#> [1] "Coherence between samples and replicates: OK"
#> [1] "File extension: OK"
#> [1] "Coherence between fw and rv fastq filename: OK"
#> [1] "Unique tag combinations : OK"
CheckFileinfo(file=mock_composition, file_type="mock_composition")
#> [1] "Action types : OK"
CheckFileinfo(file=asv_list, file_type="asv_list")
#> [1] "1 to 1 relation between asv_id and asv : OK"

Merge - Demultiplex

According to your wetlab and sequencing protocol each fastq file can contain one or more sample-replicates, and sequences may or may not contain tags (for demultiplexing) and primer sequences.

In this tutorial, the fastq files contain reads from several samples, so read pairs must be

  • merged and quality filtered (Merge)
  • demultiplexed and tags and primers trimmed off (SortReads)

See the From fastq to data frame on how to deal with other experimental design.

Merge and quality filter

  • fastqinfo: Either a csv file, or a data frame.
  • fastq_dir: Directory containing the input fastq files.
  • fastainfo_df: is the output of Merge. It is the updated version of fastqinfo, where fastq file names have been replaced by fasta file names.
  • fasta_dir: Directory containing the input fasta files for SortReads. This directory is created by Merge.

See ?Merge for the options of quality filtering.

merged_dir <- file.path(outdir, "merged")
fastainfo_df <- Merge(fastqinfo, 
                      fastq_dir=fastq_dir, 
                      vsearch_path=vsearch_path, 
                      outdir=merged_dir,
                      fastq_maxee=1,
                      fastq_maxns=0,
                      fastq_allowmergestagger=F
                      )

Demultiplex and Trim

The SortReads function will demultiplex the fasta files according to the tag combinations and trim the primers from the reads.

See the help (?SortReads) for setting the correct parameters for demultiplexing and primer trimming:

  • If check_reverse is TRUE, SortReads checks the reverse complementary stand as well.
  • sortedinfo_df is updated version of fastainfo. This data frame and the files listed in it are the input of the Dereplicate.
sorted_dir <- file.path(outdir, "sorted")
sortedinfo_df <- SortReads(fastainfo_df, 
                           fasta_dir=merged_dir, 
                           outdir=sorted_dir, 
                           check_reverse=TRUE, 
                           cutadapt_path=cutadapt_path, 
                           vsearch_path=vsearch_path
                           )

Dereplicate

At this stage, you should have one FASTA file per sample-replicate, containing merged reads with tags and primers removed.

These FASTA files should now be dereplicated, and each unique sequence (ASV) will be assigned a numerical ID (asv_id). The result is a data frame (read_count_df) that can also be written to a CSV file if outfile is specified. Saving this file is recommended, as it serves as the starting point for filtering. You can test different filtering strategies from this file without needing to re-run the time-consuming merging, demultiplexing, and trimming steps.

If you wish to retain ASV IDs from a previous analysis, you can provide a reference file (asv_list) containing earlier ASVs and their IDs. The complete, updated list of ASVs will then be written to updated_asv_list.

If only updated_asv_list is provided (and asv_list is omitted), the function will write the full list of current ASVs with assigned IDs. This will be useful when analyzing further data sets to ensure consistent ASV IDs across studies.

outfile <- file.path(outdir, "1_before_filter.csv")
updated_asv_list <- file.path(outdir, "ASV_list_with_IDs.csv")

read_count_df <- Dereplicate(sortedinfo_df, 
                             dir=sorted_dir, 
                             outfile=outfile, 
                             asv_list=asv_list, 
                             updated_asv_list=updated_asv_list
                             )

Filter

vtamR has a large number of functions to filter out ASVs or occurrences of ASVs in a given sample-replicate, which are the results of technical or biological problems associated with metabarcoding.

The different filters

  • can be applied to the original data set and then the results pooled together by using the PoolFilters function (accepting only occurrences that pass all filters)
  • or can be applied sequentially in the order defined according to your needs.

In this section, I give an example using most filtering options. It is up to you to construct your own pipeline.

Each function returns a data frame with the filtered output. You can also write the results to a csv file. This can be useful for tracking the presence/absence of some of the ASVs/samples throughout the analyses using the HistoryBy and SummarizeBy functions.

I will start the names of the output files by a number to keep the order of the different steps and allow the SummarizeBy function to make final statistics at each step of the analyses.

We can also follow the evolution of the number of reads and ASVs remaining in the data set after each filtering steps. Let’s define a data frame (stat_df), that will contain this information, and we will complete it after each step.

stat_df <- data.frame(parameters=character(),
                      asv_count=integer(),
                      read_count=integer(),
                      sample_count=integer(),
                      sample_replicate_count=integer())

Then we add a line with information on the unfiltered data set.

stat_df <- GetStat(read_count_df, stat_df, stage="Input", params=NA)
  • params: String containing the major parameters of each filtering steps.
  • stage: String referring to the filtering step.

Denoising by swarm

Swarm (Mahé et al., 2015) is a powerful and quick computer program to denoise the data set by clustering ASVs that are likely to be the result of sequencing errors to ASVs representing real biological sequences. It will considerably reduce the number of ASVs.

By default, vtamR uses 1 as swam’s d and the fastidious algorithm. See ?Swarm on how to change this. It is possible to run Swarm separately for each sample (by_sample=TRUE; default) or at once for the whole data set. This second option is more efficient in reducing the number of ASVs but can cluster very similar real sequences appearing in different samples.

If you are interested in intra-specific variability, I suggest by_sample=TRUE or skip this step, since Swarm can cluster together very similar real ASV.

by_sample <- TRUE
outfile <- file.path(outdir, "2_Swarm_by_sample.csv")

read_count_df <- Swarm(read_count_df, 
                       outfile=outfile, 
                       swarm_path=swarm_path, 
                       num_threads=num_threads, 
                       by_sample=by_sample
                       )

stat_df <- GetStat(read_count_df, stat_df, stage="Swarm", params=by_sample)

Let’s check the reduction of the number of variants after running swarm.

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12

We can also plot the number of reads per variant.

Histogram_ReadCountByVariant(read_count_df, binwidth=100)

As we can see, most ASV have very low read counts.

LFNglobalReadCount

Although, swarm has reduced considerably the number of ASVs, there are still many ASVs with low read count. The LFNglobalReadCount function filter out all ASVs with total read count bellow a threshold.

Let’s eliminate singletons and see how the number of ASV and reads are have changed.

global_read_count_cutoff = 2
outfile <- file.path(outdir, "3_LFNglobalReadCount.csv")

read_count_df <- LFNglobalReadCount(read_count_df, 
                                    cutoff=global_read_count_cutoff, 
                                    outfile=outfile
                                    )

stat_df <- GetStat(read_count_df, 
                   stat_df, 
                   stage="LFNglobalReadCount", 
                   params=global_read_count_cutoff
                   )
knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12

FilterIndel

This filter is applicable only for coding sequences. The idea is that if the length of the ASV differs from the most frequent ASV length by a number that is not a multiple of 3, it is likely to be an erroneous sequence or a pseudo-gene.

outfile <- file.path(outdir, "4_FilterIndel.csv")

read_count_df <- FilterIndel(read_count_df, outfile=outfile)

stat_df <- GetStat(read_count_df, stat_df, stage="FilterIndel")

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12

FilterCodonStop

This filter is applicable only for coding sequences. It checks the number of codon STOPs in all 3 reading frames, and eliminates ASVs with STOP in all of them.

The numerical id of the genetic code can be chosen from NCBI. By default, it is the invertebrate mitochondrial genetic code (5), since the STOP codons of this genetic codes are also STOP codons in almost all genetic codes.

outfile <- file.path(outdir, "5_FilterCodonStop.csv")
genetic_code = 5

read_count_df <- FilterCodonStop(read_count_df, outfile=outfile, genetic_code=genetic_code)

stat_df <- GetStat(read_count_df, stat_df, stage="FilerCodonStop", params=genetic_code)

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12

FilterChimera

This function will run the uchime3_denovo function implemented in vsearch. It can be run sample by sample (by_sample=T; default), or on the whole data set at once.

FilterChimera will eliminate ASVs detected as chimeras when running on the whole dataset at once (by_sample=F). If by_sample=T, ASVs are eliminated if they have been identified as chimeras in at least sample_prop proportion of the samples where they are present.

  • abskew: A chimera must be at least abskew times less frequent that the parental ASVs.
abskew=2
by_sample = T
sample_prop = 0.8
outfile <- file.path(outdir, "6_FilterChimera.csv")

read_count_df <- FilterChimera(read_count_df, outfile=outfile, vsearch_path=vsearch_path, by_sample=by_sample, sample_prop=sample_prop, abskew=abskew)

params <- paste(abskew, by_sample, sample_prop, sep=";")
stat_df <- GetStat(read_count_df, stat_df, stage="FilterChimera", params=params)

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12

FilterRenkonen

Let’s eliminate aberrant replicates that are not similar to other replicates of the same sample.

We will calculate the renkonen distances among all pairs of replicates within sample.

renkonen_within_df <- MakeRenkonenDistances(read_count_df, compare_all=FALSE)

Let’s make a density plot of the renkonen distances

DensityPlot_RenkonenDistance(renkonen_within_df)

We can also make a barplot of the renkonen distances, and use different colors for different sample types (real/mock/negative).

Note:

  • sortedinfo_df contains a sample and sample_type columns, but any data frame with these columns would do.
  • If you have restarted the analyses after the SortReads function, and
    sortedinfo_df is not in your environment, just use the sortedinfo file (in the vtamR_demo_out/sorted/ directory).
Barplot_RenkonenDistance(renkonen_within_df, sample_types=sortedinfo_df, x_axis_label_size=6)

In this example, we have only a few distances, so the density plot is not very relevant. The barplot indicates that the replicate 1 of the tnegtag sample (negative control) is distant from the other two replicates. Lets, establish the Renkonen cutoff value to 0.4. We will filter out replicates that have Renkonen distances above this cutoff to most other replicates of the sample. In this case, replicate 1 of the tnegtag sample will be eliminated.

outfile <- file.path(outdir, "7_FilterRenkonen.csv")
cutoff <- 0.4

read_count_df <- FilterRenkonen(read_count_df, outfile=outfile, cutoff=cutoff)

stat_df <- GetStat(read_count_df, stat_df, stage="FilerRenkonen", params=cutoff)

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11

FilterPCRerror

If you have done the denoising step by swarm, this function probably redundant.

FilterPCRerror filters out an ASV if it is very similar to another, more frequent ASV. max_mismatch argument tells how many mismatches are allowed between a pair of ASV to be evaluated for PCR error. If the proportion of the read_counts of two similar ASVs is less or equal to pcr_error_var_prop, the less abundant ASV is flagged as a PCR error.

OptimizePCRerror

The default value of pcr_error_var_prop (0.1) is arbitrary. To choose a value adapted to your data set, you can use the OptimizePCRerror function.

OptimizePCRerror, will find all highly similar ASV pairs (max_mismatch=1 by default) within a mock sample, where one ASV is expected, and the other is not. Their read_count ratio is printed in the output file.

The function considers only ASVs with more than min_read_count reads in the sample to avoid a ratio based on low read counts that are more influenced by stochastic events and will be probably filtered out anyway.

See details using the help: ?OptimizePCRerror.

Note: I use max_mismatch=2 in this example, to have some output in my very small (and artificial) data set, but I recommend you to use max_mismatch=1.

Note: If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.

outfile <- file.path(outdir, "OptimizePCRerror.csv")
OptimizePCRerror_df <- OptimizePCRerror(read_count_df, 
                                        mock_composition=mock_composition, 
                                        vsearch_path=vsearch_path, 
                                        outfile=outfile, 
                                        max_mismatch=2, 
                                        min_read_count=5
                                        )

Let’s see the beginning of the output:

knitr::kable(head(OptimizePCRerror_df), format = "markdown")
sample expected_read_count unexpected_read_count pcr_error_var_prop expected_asv_id unexpected_asv_id expected_asv unexpected_asv
tpos1 1022 14 0.0136986 1688 5765 CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT CCTTTATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT
tpos1 1022 5 0.0048924 1688 1682 CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTGATACCTATTATAATT
tpos1 12965 12 0.0009256 2009 6042 TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGATCATGGCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT

FilterPCRerror

It seems that Swarm has done a good job. The highest read count ratio is 0.013 in the output of OptimizePCRerror. This should be taken as a lower limit to pcr_error_var_prop. Let’s use 0.05 for filtering out PCR errors sample by sample.

See the help for more detail ?FilterPCRerror.

Note: Again, I use max_mismatch <- 2 in this example, since I have used this value for OptimizePCRerror, but I recommend you to use max_mismatch <- 1.

pcr_error_var_prop <- 0.05
max_mismatch <- 2
outfile <- file.path(outdir, "8_FilterPCRerror.csv")

read_count_df <- FilterPCRerror(read_count_df, 
                                outfile=outfile, 
                                vsearch_path=vsearch_path, 
                                pcr_error_var_prop=pcr_error_var_prop, 
                                max_mismatch=max_mismatch
                                )

params <- paste(pcr_error_var_prop, max_mismatch, by_sample, sep=";")
stat_df <- GetStat(read_count_df, stat_df, stage="FilterPCRerror", params=params)

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11

LFNsampleReplicate

LFNsampleReplicate will eliminate occurrences with very low read counts compared to the total number of reads in the sample-replicate. The default cutoff proportion is 0.001. We can have an idea of the maximum value of this cutoff, by examining the proportions of expected ASVs in the mock samples using the OptimizeLFNsampleReplicate function.

OptimizeLFNsampleReplicate

Note: If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.

outfile = file.path(outdir, "OptimizeLFNsampleReplicate.csv")
OptimizeLFNsampleReplicate_df <- OptimizeLFNsampleReplicate(read_count=read_count_df,
                                                            mock_composition=mock_composition,
                                                            outfile=outfile)
knitr::kable(head(OptimizeLFNsampleReplicate_df), format = "markdown")
sample replicate action asv_id read_count read_count_sample_replicate lfn_sample_replicate_cutoff asv taxon
tpos1 3 keep 120 118 23972 0.0049 ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC Caenis pusilla
tpos1 1 keep 708 189 21939 0.0086 CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT Phoxinus phoxinus
tpos1 2 keep 708 244 27584 0.0088 CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT Phoxinus phoxinus
tpos1 2 keep 120 252 27584 0.0091 ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC Caenis pusilla
tpos1 3 keep 708 228 23972 0.0095 CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT Phoxinus phoxinus
tpos1 2 keep 1688 331 27584 0.0119 CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT Synorthocladius semivirens

The lowest proportion of read count of an expected ASV to the total read count of its sample-relicate is 0.0049 (Sequence of Caenis pusilla in the replicate 3 of tpos1 mock sample). A cutoff values for LFNsampleReplicate higher than this will eliminate some of the expected occurrences and create false negatives. Therefore, we choose 0.004 as a cutoff.

LFNsampleReplicate

lfn_sample_replicate_cutoff <- 0.004
outfile <- file.path(outdir, "9_LFNsampleReplicate.csv")

read_count_df <- LFNsampleReplicate(read_count_df, 
                                    cutoff=lfn_sample_replicate_cutoff, 
                                    outfile=outfile
                                    )

stat_df <- GetStat(read_count_df, 
                   stat_df, 
                   stage="LFNsampleReplicate", 
                   params=lfn_sample_replicate_cutoff
                   )

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11
LFNsampleReplicate 0.004 60 120784 4 11

FilterMinReplicate 1

To ensure repeatability, we can accept occurrences if they are present in at least min_replicate_number replicates of the sample (2 by default).

min_replicate_number <- 2
outfile <- file.path(outdir, "10_FilterMinReplicate.csv")

read_count_df <- FilterMinReplicate(read_count_df, 
                                    min_replicate_number, 
                                    outfile=outfile
                                    )
stat_df <- GetStat(read_count_df, 
                   stat_df, 
                   stage="FilterMinReplicate", 
                   params=min_replicate_number
                   )

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11
LFNsampleReplicate 0.004 60 120784 4 11
FilterMinReplicate 2 38 119435 4 11

LFNvariant and LFNreadCount

The LFNvariant filter will eliminate occurrences with very low read counts compared to the total number of reads of the ASV. The default cutoff proportion is 0.001. This filter is designed to filter out occurrences present in the data set due to tag-jump or light inter sample contamination.

The LFNreadCount filter simply eliminates occurrences with read counts bellow a cutoff (10 by default).

To find the best cutoff values for these two filters, the OptimizeLFNreadCountLFNvariant will count the number of false positive, false negative and true positive occurrences using a series of combination of cutoff values of these two filters. To run this function, we need a known_occurrences_df data frame that lists all occurrences that are known to be a FP, TP.

MakeKnownOccurrences

The MakeKnownOccurrences function will identify false positive, false negative and true positive occurrences in controls samples (mock and negative). Some false positives can also be identified in real samples if samples of different habitats are included in the data sets.

The false positive and the true positive occurrences are written to the known_occurrences data frame (or file), false negatives to the missing_occurrences data frame (or file), and a performance_metrics data frame is also produced with the count of these occurrences. For details see ?MakeKnownOccurrences.

This function takes a read_count_samples data frame as an input, where the replicates of the same sample have been pooled (see ?PoolReplicates)

Note:

  • If you do not know the exact sequences of the species present in your mock samples, and thus, you do not have the mock_composition file yet, read the How to make a mock_composition file.
  • sortedinfo_df contains a sample and sample_type replicate and habitat columns, but any data frame with these columns would do.
  • If you have restarted the analyses after the SortReads function, and
    sortedinfo_df is not in your environment, just use the sortedinfo file (in the vtamR_demo_out/sorted directory).
# Pool replicates
read_count_samples_df <- PoolReplicates(read_count_df)

# Detect known occurrences
results <- MakeKnownOccurrences(read_count_samples = read_count_samples_df, 
                                sortedinfo=sortedinfo_df, 
                                mock_composition=mock_composition
                                )

# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]

OptimizeLFNreadCountLFNvariant

The LFNreadCount and LFNvariant functions are run for a series of cutoff value combinations of the two filters, followed by FilterMinReplicate. For each parameter combination, the number of FN, TP, and FP is reported. Chose the parameter setting that minimizes, FN and FP.

Here we will use the default range of cutoffs to test, but you can set the minimum, the maximum and the increment for the cutoff values for both filters. (see ?OptimizeLFNreadCountLFNvariant). We set min_replicate_number to 2, to eliminate non-repeatable occurrences among the three replicates of each sample.

outfile = file.path(outdir, "OptimizeLFNreadCountLFNvariant.csv")

OptimizeLFNreadCountLFNvariant_df <- OptimizeLFNreadCountLFNvariant(
  read_count_df,
  known_occurrences=known_occurrences_df,
  outfile= outfile, 
  min_replicate_number=2
  )

#head(OptimizeLFNreadCountLFNvariant_df)
knitr::kable(head(OptimizeLFNreadCountLFNvariant_df), format = "markdown")
lfn_read_count_cutoff lnf_variant_cutoff FN TP FP
10 0.001 0 6 4
15 0.001 0 6 4
20 0.001 0 6 4
25 0.001 0 6 4
30 0.001 0 6 4
35 0.001 0 6 4

LFNvariant, LFNreadCount

From the output, we choose 0.001 for the cutoff of LFNvariant and 10 for LFNreadCount, since this is the less stringent combination that keeps all expected occurrences (6 TP, 0 FN), and has less FP (4 FP).

We will run the two filters on the same input data frame (for which the parameters has been optimized), and pool the results by accepting only occurrences that pass both filters by the PoolFilters function.

See ?LFNvariant, ?LFNreadCount and ?PoolFilters for details.

LFNvariant

lnf_variant_cutoff = 0.001
outfile <- file.path(outdir, "11_LFNvariant.csv")

read_count_df_lnf_variant <- LFNvariant(read_count_df, 
                                        cutoff=lnf_variant_cutoff, 
                                        outfile=outfile
                                        )
stat_df <- GetStat(read_count_df_lnf_variant, 
                   stat_df, 
                   stage="LFNvariant", 
                   params=lnf_variant_cutoff)

LFNreadCount

lfn_read_count_cutoff <- 10
outfile <- file.path(outdir, "12_LFNreadCount.csv")

read_count_df_lfn_read_count <- LFNreadCount(read_count_df, 
                                             cutoff=lfn_read_count_cutoff, 
                                             outfile=outfile
                                             )
stat_df <- GetStat(read_count_df_lfn_read_count, 
                   stat_df, stage="LFNreadCount", 
                   params=lfn_read_count_cutoff
                   )

Combine results

outfile <- file.path(outdir, "13_poolLFN.csv")
read_count_df <- PoolFilters(read_count_df_lfn_read_count, 
                             read_count_df_lnf_variant, 
                             outfile=outfile
                             )
stat_df <- GetStat(read_count_df, 
                   stat_df, 
                   stage="FilterLFN"
                   )
# delete temporary data frames
rm(read_count_df_lfn_read_count)
rm(read_count_df_lnf_variant)

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11
LFNsampleReplicate 0.004 60 120784 4 11
FilterMinReplicate 2 38 119435 4 11
LFNvariant 0.001 38 119366 4 11
LFNreadCount 10 34 119372 4 11
FilterLFN NA 34 119320 4 10

FilterMinReplicate 2

Let’s run again FilterMinReplicate to ensure repeatability among replicates of the sample (2 by default).

min_replicate_number <- 2
outfile <- file.path(outdir, "14_FilterMinReplicate.csv")

read_count_df <- FilterMinReplicate(read_count_df, 
                                    min_replicate_number, 
                                    outfile=outfile
                                    )
stat_df <- GetStat(read_count_df, 
                   stat_df, 
                   stage="FilterMinReplicate", 
                   params=min_replicate_number
                   )

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11
LFNsampleReplicate 0.004 60 120784 4 11
FilterMinReplicate 2 38 119435 4 11
LFNvariant 0.001 38 119366 4 11
LFNreadCount 10 34 119372 4 11
FilterLFN NA 34 119320 4 10
FilterMinReplicate1 2 31 119147 3 9

Pool, TaxAssign and document

PoolReplicates

Let’s pool replicates of the same sample. PoolReplicates function will take the mean non-zero read counts of each ASV over replicates of the same sample.

outfile <- file.path(outdir, "15_PoolReplicates.csv")
read_count_samples_df <- PoolReplicates(read_count_df, 
                                        outfile=outfile
                                        )
stat_df <- GetStat(read_count_samples_df, 
                   stat_df, 
                   stage="PoolReplicates"
                   )

knitr::kable(stat_df, format = "markdown")
parameters asv_count read_count sample_count sample_replicate_count
Input NA 10461 123530 4 12
Swarm TRUE 418 123530 4 12
LFNglobalReadCount 2 147 123236 4 12
FilterIndel NA 143 123196 4 12
FilerCodonStop 5 143 123196 4 12
FilterChimera 2;TRUE;0.8 122 122619 4 12
FilerRenkonen 0.4 122 122570 4 11
FilterPCRerror 0.05;2;TRUE 117 122531 4 11
LFNsampleReplicate 0.004 60 120784 4 11
FilterMinReplicate 2 38 119435 4 11
LFNvariant 0.001 38 119366 4 11
LFNreadCount 10 34 119372 4 11
FilterLFN NA 34 119320 4 10
FilterMinReplicate1 2 31 119147 3 9
PoolReplicates NA 31 40041 3 NA

Get performance metrics

Run MakeKnownOccurrences again to get performance metrics (FP, FN, FP). This time we will write the output data frames to files as well. The performance_metrics file will give you the count of FP, FN, TP, accuracy and sensitivity. You can find false negatives in missing_occurrences, and true and false positives in known_occurrences.

missing_occurrences <- file.path(outdir, "missing_occurrences.csv")
performance_metrics <- file.path(outdir, "performance_metrics.csv")
known_occurrences <- file.path(outdir, "known_occurrences.csv")

results <- MakeKnownOccurrences(read_count_samples_df, 
                                sortedinfo=sortedinfo_df, 
                                mock_composition=mock_composition, 
                                known_occurrences=known_occurrences, 
                                missing_occurrences=missing_occurrences,
                                performance_metrics=performance_metrics
                                )
# give explicit names to the 3 output data frames
known_occurrences_df <- results[[1]]
missing_occurrences_df <- results[[2]]
performance_metrics_df <- results[[3]]

knitr::kable(performance_metrics_df, format = "markdown")
TP FP FN Precision Sensitivity
6 4 0 0.6 1

TaxAssign

If you have already assigned ASV to taxa, (before making the mock_composition file) you can use the same output (asv_tax data frame) and skip this step.

If you haven’t assigned ASV to taxa yet it is time to do it.

See the brief description of the algorithm with ?TaxAssign.

For the format of taxonomy and blast_db check the Reference database for taxonomic assignments section. You can also download a ready to use COI database (See Installation).

outfile <- file.path(outdir, "TaxAssign.csv")

asv_tax <- TaxAssign(asv=read_count_samples_df, 
                     taxonomy=taxonomy, 
                     blast_db=blast_db, 
                     blast_path=blast_path, 
                     outfile=outfile, 
                     num_threads=num_threads
                     )

Supplementary functions

RandomSeq

Random select n sequences from each input fasta file. It can be used before or after demultiplexing (SortReads). However, if using LFNvariant and LFNreadCount which are partially based on the number of reads in negative controls, standardizing the number of reads among samples does not make sense. Thus I use RandomSeq after Merge, to get the same number of reads for each replicate series (same samples, different replicates in each fasta file).

  • fastainfo: Either a csv file, or a data frame, with the following columns: tag_fw, primer_fw, tag_rv, primer_rv, sample, sample_type, habitat, replicate, fasta.
  • fasta_dir: Directory containing the input fasta files.
  • n: Number of sequences to be taken randomly (without replacement).
randomseq_dir = file.path(outdir, "random_seq")
fastainfo_df <- RandomSeq(fastainfo_df, 
                          fasta_dir=merged_dir, 
                          outdir=randomseq_dir, 
                          vsearch_path=vsearch_path, 
                          n=10000
                          )
#> [1] "mfzr_1_fw.fasta"
#> [1] "mfzr_2_fw.fasta"
#> [1] "mfzr_3_fw.fasta"

HistoryBy

This function scans all files in the dir that starts by a number. (See file names of the output files of the different filtering steps). It will select all lines were the feature (asv_id/asv/sample/replicate) has a value we are looking for.

Examples

Get the history of asv_id (feature) 27 (value).

tmp_ASV_27 <- HistoryBy(dir=outdir, 
                        feature="asv_id", 
                        value="27"
                        )

knitr::kable(head(tmp_ASV_27), format = "markdown")
file asv_id sample replicate read_count asv
1_before_filter.csv 27 tpos1 1 131 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
1_before_filter.csv 27 tpos1 2 257 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
1_before_filter.csv 27 tpos1 3 89 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
2_Swarm_by_sample.csv 27 tpos1 1 178 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
2_Swarm_by_sample.csv 27 tpos1 2 349 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
2_Swarm_by_sample.csv 27 tpos1 3 117 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT

Get the history of a sequence of the ASV (feature).

tmp_replicate_1 <- HistoryBy(dir=outdir, 
                             feature="asv",
                             value="CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT"
                             )

knitr::kable(tmp_replicate_1, format = "markdown")
file asv_id sample replicate read_count asv
1_before_filter.csv 1589 tpos1 1 23 CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT
1_before_filter.csv 1589 tpos1 2 23 CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT
1_before_filter.csv 1589 tpos1 3 31 CCTTTATTTTATTTTCGGTATCTGGTCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT

Get the history of the sample (feature) tpos1 (value).

tmp_sample_tpos1 <- HistoryBy(dir=outdir, 
                              feature="sample", 
                              value="tpos1"
                              )

knitr::kable(head(tmp_sample_tpos1), format = "markdown")
file asv_id sample replicate read_count asv
1_before_filter.csv 7 tpos1 1 6 AATGTATCTAATCTTTGGAGGTTTTTCTGGTATTATTGGAACAGCTTTATCTATTTTAATCAGAATAGAATTATCGCAACCAGGAAACCAAATTTTAATGGGAAACCATCAATTATATAATGTAATTGTAACTTCTCACGCTTTTATTATGATTTTTTTTATGGTAATGCCAATTTTATTA
1_before_filter.csv 8 tpos1 1 3 ACCTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT
1_before_filter.csv 9 tpos1 1 1 ACTACACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
1_before_filter.csv 10 tpos1 1 1 ACTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCCATGATT
1_before_filter.csv 11 tpos1 1 1 ACTATACCTTACCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGGAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
1_before_filter.csv 12 tpos1 1 1 ACTATACCTTATCTTCGCAGTATTCTCAAGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT

SummarizeBy

This function scans all files in the dir that starts by a number. (See file names of the output files of the different filtering steps).

It will group each file by a variable (asv/asv_id/sample/replicate) and summarize a feature (asv/asv_id/sample/replicate/read_count). If the feature is read_count, it will give the sum of the read counts for each value of the variable in each file. Otherwise, it returns the number of distinct values of the feature for each value of the variable in each file.

Examples

Get the number of reads of each sample after each filtering steps.

From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and there is a considerable variation among the number of reads of different real samples.

read_count_by_sample <- SummarizeBy(dir=outdir, 
                                    feature="read_count", 
                                    grouped_by="sample"
                                    )

knitr::kable(read_count_by_sample, format = "markdown")
sample 1_before_filter 2_Swarm_by_sample 3_LFNglobalReadCount 4_FilterIndel 5_FilterCodonStop 6_FilterChimera 7_FilterRenkonen 8_FilterPCRerror 9_LFNsampleReplicate 10_FilterMinReplicate 11_LFNvariant 12_LFNreadCount 13_poolLFN 14_FilterMinReplicate 15_PoolReplicates
14ben01 28115 28115 28079 28079 28079 28067 28067 28067 27877 27799 27799 27799 27799 27799 9288
14ben02 20750 20750 20683 20664 20664 20664 20664 20664 20134 19206 19206 19206 19206 19206 6583
tnegtag 406 406 363 363 363 354 305 305 305 288 219 225 173 0 0
tpos1 74259 74259 74111 74090 74090 73534 73534 73495 72468 72142 72142 72142 72142 72142 24170

Get the number asv for each sample after each filtering steps.

From this data frame, we can see that the negative control sample become “clean” after the LFN filters, and the mock samples has 10 ASVs at the end.

asv_by_sample <- SummarizeBy(dir=outdir, 
                             feature="asv", 
                             grouped_by="sample"
                             )

knitr::kable(asv_by_sample, format = "markdown")
sample 1_before_filter 2_Swarm_by_sample 3_LFNglobalReadCount 4_FilterIndel 5_FilterCodonStop 6_FilterChimera 7_FilterRenkonen 8_FilterPCRerror 9_LFNsampleReplicate 10_FilterMinReplicate 11_LFNvariant 12_LFNreadCount 13_poolLFN 14_FilterMinReplicate 15_PoolReplicates
14ben01 2006 59 24 24 24 22 22 22 5 3 3 3 3 3 3
14ben02 3002 122 59 57 57 57 57 57 35 20 20 20 20 20 20
tnegtag 182 61 23 23 23 22 21 21 21 13 10 6 5 0 0
tpos1 5579 212 76 74 74 56 56 51 13 10 10 10 10 10 10

Get the number asv_id for each replicate after each filtering steps.

We can see that number of ASVs are comparable in different replicates.

asvid_by_replicate <- SummarizeBy(dir=outdir, 
                                  feature="asv_id", 
                                  grouped_by="replicate"
                                  )
#> [1] "WARNING: replicate variable is not present in vtamR_demo_out/15_PoolReplicates.csv"

knitr::kable(asvid_by_replicate, format = "markdown")
replicate 1_before_filter 2_Swarm_by_sample 3_LFNglobalReadCount 4_FilterIndel 5_FilterCodonStop 6_FilterChimera 7_FilterRenkonen 8_FilterPCRerror 9_LFNsampleReplicate 10_FilterMinReplicate 11_LFNvariant 12_LFNreadCount 13_poolLFN 14_FilterMinReplicate
1 4392 211 115 114 114 95 90 87 37 28 28 28 28 28
2 4485 231 112 111 111 90 90 86 47 36 36 32 32 29
3 4821 160 92 88 88 77 77 76 35 33 33 26 26 26

UpdateASVlist

Pools unique asv - asv_id combinations from the input data frame (read_count_df) and from the input file (asv_list).

The input file is typically a CSV file containing ASVs seen in earlier data sets with their asv_id. If there is a conflict within or between the input data the function quits with an error message. Otherwise writes the updated_asv_list to the outfile.

The safest option of avoiding incoherence between asv_ids of earlier and present runs is to use the asv_list and updated_asv_list parameters in the Dereplicate function as it is done in this tutorial. This will synchronize the asv_id in this run with earlier ones, and writes an updated file with all ASV from earlier and the present data set. In this case, calling the UpdateASVlist function is not necessary, since it is automatically called from Dereplicate.

However, if you are analyzing a large number of very large data sets, the complete asv_list will grow quickly, and might cause memory issues. Most of the ASVs in this list are singletons, and filtered out during the analyses. Therefore, you can opt for homogenizing the asv_ids with earlier runs by the Dereplicate function, without writing an updated_asv_list. Then update the asv_list after the first steps of filtering (e.g. Swarm, LFNglobalReadCount) to keep only ASVs that are more frequent and more likely to appear in future data sets. It is still a quite safe option, and reduces greatly the number of ASVs kept in this file.

updated_asv_list <- file.path(outdir, 
                              "updated_ASV_list.csv"
                              )
UpdateASVlist(read_count_df,
              asv_list=asv_list, 
              outfile=updated_asv_list
              )

PoolDatasets

More than one overlapping marker

This function pools different data sets and it is particularly useful, if the results should be pooled from more than one overlapping markers. In that case, ASVs identical on their overlapping regions are pooled into groups, and different ASVs of the same group are represented by their centroid (longest ASV of the group). Pooling can take the mean read counts of the ASVs (mean_over_markers=T; default) or their sum (mean_over_markers=F).

The function takes several input CSV files, each in long format containing asv_id, sample, read_count and asv columns. The file names should be organized in data frame, with the marker names for each file.

  • outfile: Name of the output CSV file with the pooled data set (asv_id, sample, read_count, asv). ASVs are grouped to the same line if identical in their overlapping region, and only the centroids appear in the asv column.
  • asv_with_centroids: Name of the output CSV file containing each of the the original ASVs (with samples, markers, and read_count) as well as their centroids.
  • The data frame returned by the function corresponds to the outfile.
files <- data.frame(file=c("vtamR_test/out_mfzr/15_PoolReplicates.csv",
                           "vtamR_test/test/15_PoolReplicates_ZFZR.csv"),
                    marker=c("MFZR", 
                             "ZFZR")
                    )

outfile <- file.path(outdir, "Pooled_datasets.csv") 
asv_with_centroids <- file.path(outdir, 
                                "Pooled_datasets_asv_with_centroids.csv"
                                ) 

read_count_pool <- PoolDatasets(files, 
                                outfile=outfile, 
                                asv_with_centroids=asv_with_centroids, 
                                mean_over_markers=T,
                                vsearch_path=vsearch_path
                                )

Only one marker

Pooling the results of different data sets of the same marker is very simple. Basically, the input files (in long format with asv, asv_id, sample, read_count columns) are concatenated. The PoolDatasets function also checks if sample names are unique. If not, it sums the read count of the same sample and same ASV, but returns a warning.

The output file or data frame can be rearranged to wide format by the WriteASVtable function.

CountReadsDir

Count the number of reads in fasta or fastq files found in the input directory. Input files can be gz compressed or uncompressed, but zip files are not supported.

The fastainfo and the sortedinfo files contain the number of reads after Merge or SortReads, so no need to run CountReadsDir separately. This function can be useful for counting the number of reads in the input fastq files.

  • dir: Input directory containing the fasta of fastq files
  • file_type: [fasta/fastq]
  • pattern: Regular expression; Check only files for pattern in the file name
dir <- "vtamR_test/data"
df <- CountReadsDir(dir, 
                    pattern="_fw.fastq.gz", 
                    file_type="fastq"
                    )

Barplot_ReadCountBySample

Make a bar plot of read counts by sample (sample_replicate=F) or sample-replicate (sample_replicate=T). Can use different colors for different sample types (real/mock/negative)

  • read_count_df: Input data frame
  • sample_types: data frame or CSV file containing info on sample types for each sample
sortedinfo <- file.path(sorted_dir, "sortedinfo.csv")
Barplot_ReadCountBySample(read_count_df=read_count_df, 
                          sample_replicate=F, 
                          sample_types=sortedinfo
                          )

Histogram_ReadCountByVariant

Histogram of read counts by ASV

  • read_count_df: Input data frame
  • min_read_count: Ignore variants with read count bellow this value
  • binwidth: Width of bins
Histogram_ReadCountByVariant(read_count_df, 
                             min_read_count=10, 
                             binwidth=1000
                             )

Renkonen distances

Calculate the Renkonen distances among all replicates (compare_all=TRUE) or among replicates of the same sample (compare_all=FALSE) and plot them.

  • read_count_df: Input data frame.
  • sample_types: Data frame or CSV file containing info on sample types for each sample.
renkonen_within_df <- MakeRenkonenDistances(read_count_df, compare_all=FALSE)
Barplot_RenkonenDistance(renkonen_within_df, sample_types=sortedinfo)

DensityPlot_RenkonenDistance(renkonen_within_df)


renkonen_all_df <- MakeRenkonenDistances(read_count_df, compare_all=TRUE)
DensityPlot_RenkonenDistance(renkonen_all_df)

I/O files and data frames

fastqinfo

CSV file with information on input fastq files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.

  • tag_fw: Sequence tag on the 5’ of the fw read (NA if file is already demultiplexed)
  • primer_fw: Forward primer (NA if primer has been trimmed)
  • tag_rv: Sequence tag on the 3’ of the rv read (NA if file is already demultiplexed)
  • primer_rv: Reverse primer (NA if primer has been trimmed)
  • sample: Name of the sample (alpha-numerical)
  • sample_type: real/mock/negative
  • habitat: If real or mock samples are from different habitats that cannot contain the same type of organisms (e.g. terrestrial vs. marine), this information is used for detecting false positives. Use NA otherwise. Use NA for negative controls.
  • replicate: Numerical id of a replicate within sample (e.g. Sample1 can have replicate 1, 2 or 3)
  • fastq_fw: Forward fastq file
  • fastq_rv: Reverse fastq file

fastainfo

CSV file with information on input fasta files, primers, tags, samples with the following columns. Each line corresponds to a sample-replicate combination.

  • tag_fw: Sequence tag on the 5’ of the fw read (NA if file is already demultiplexed)
  • primer_fw: Forward primer (NA if primer has been trimmed)
  • tag_rv: Sequence tag on the 3’ of the rv read (NA if file is already demultiplexed)
  • primer_rv: Reverse primer (NA if primer has been trimmed)
  • sample: Name of the sample (alpha-numerical)
  • sample_type: real/mock/negative
  • habitat: If real or mock samples are from different habitats that cannot contain the same type of organisms (e.g. terrestrial vs. marine), this information is used for detecting false positives. Use NA otherwise. Use NA for negative controls.
  • replicate: Numerical id of a replicate (e.g. Sample1 can have replicate 1, 2 or 3)
  • fasta: Fasta file
  • Read_count: Number of reads in the fasta file. Optional.

sortedinfo

CSV file with information on demultiplexed and primer trimmed fasta files and samples with the following columns. Each line corresponds to a sample-replicate combination.

  • sample: Name of the sample (alpha-numerical)
  • sample_type: real/mock/negative
  • habitat: If real or mock samples are from different habitats that cannot contain the same type of organisms (e.g. terrestrial vs. marine), this information is used for detecting false positives. Use NA otherwise. Use NA for negative controls.
  • replicate: Numerical id of a replicate (e.g. Sample1 can have replicate 1, 2 or 3)
  • fasta: Fasta file
  • Read_count: Number of reads in the fasta file. Optional.

mock_composition

CSV file with the following columns.

  • sample: Name of the mock sample
  • action:
    • keep: Expected ASV in the mock, that should be kept in the data set
    • tolerate: ASV that can be present in a mock, but it is not essential to keep it in the data set (e.g. badly amplified organism)
  • asv: sequence of the ASV
  • taxon: Optional; Name of the organism
  • asv_id: Optional; If there is a conflict between asv and asv_id, the asv_id is ignored

known_occurrences

CSV file or data frame with the following columns.

  • sample: Name of the sample
  • action:
    • keep: Expected ASVs in a mock sample (corresponds to True Positives)
    • delete: False Positive occurrences: unexpected ASV in a mock sample; all occurrences in negative controls; occurrences in real samples corresponding to an incompatible habitats (e.g. an ASV mostly present in marine samples is unexpected in a freshwater sample)
  • asv_id
  • asv: sequence of the ASV

asv_list

This file lists ASVs and their IDs from earlier data sets. When provided (optional), identical ASVs in the present and earlier data sets have the same ID. New ASVs (not present in asv_list) will get unique IDs not present in asv_list. It is a CSV file with the following columns:

  • asv_id: Unique numerical ID of the ASV
  • asv: ASV sequence

read_count_df

Data frame with the following columns:

  • asv: Sequence of the ASV
  • asv_id: Numerical ID of the ASV
  • sample: Sample name
  • replicate: Replicate within sample (Numerical)
  • read_count: Number of reads of the ASV in the Sample-Replicate

Reference database for taxonomic assignments

A data base is composed of two elements. A BLAST database (blast_db) and a taxonomy file.

blast_db can be produced using the makeblastdb command of BLAST:

makeblastdb -dbtype nucl -in [FASTA_FILE] -parse_seqids -taxid_map [TAXID_FILE] -out [DB_NAME]
  • FASTA_FILE is a fasta file containing reference sequences.
  • TAXID_FILE is a tab separated file with sequence IDs and the corresponding numerical taxIDs.
  • DB_NAME is the name of the newly created BLAST database.

taxonomy is a tab separated csv file with the following columns:

  • tax_id: Numerical Taxonomic ID. It can be a valid NCBI taxID, or arbitrary negative numbers for taxa not in NCBI.
  • parent_tax_id: taxID of the closest parent of tax_id.
  • rank: taxonomic rank (e.g. species, genus, subgenus, no_rank).
  • name_txt: Scientifc name of the taxon.
  • old_tax_id: taxIDs that have been merged to the tax_id by NCBI; if there is more than one for a given tax_id, make one line for each old_tax_id.
  • taxlevel: Integer associated to each major taxonomic rank. (0 => root, 1=> domain, 2=> kingdom, 3=> phylum, 4=> class, 5=> order, 6=> family, 7=> genus, 8=> species). Levels in between have 0.5 added to the next highest level (e.g. 5.5 for infraorder and for superfamily).

A ready to use COI database in BLAST format and the associated taxonomy file can be downloaded from https://osf.io/vrfwz/. It was created using mkCOInr. It is also possible to make a customized database using mkCOInr. It can be dowloaded and extracted manually of using the download_osf function of vtamR. See Installation

Options

  • compress: If TRUE, output files of the Merge, RandomSeq and SortReads functions are compressed. This saves space, but compressing/decompressing increases run time in some cases. I suggest to avoid compressing intermediate files and delete/compress them as soon as the analyses are finished. (See more in the troubleshooting section). The input fasta and fastq files of these functions can be compressed (.gz, .bz2, but NOT .zip) or uncompressed files and compression is automatically detected based on the file extension.

  • delete_tmp: Delete automatically intermediate files. TRUE by default.

  • sep: Separator used in csv files. Use the same separator in all CSV files except for the taxonomy file of the reference database, which is tab separated.

  • quiet: If TRUE print as little information to the terminal as possible. TRUE by default.

Glossary

  • ASV: Amplicon Sequence Variant. Unique sequnece, caracterized by the number of reads in each sample-replicate.
  • asv_id: Unique numerical ID of an ASV.
  • demultiplexing: Sorting reads in a fasta (or fastq) file to different sample-replicates according to the tags present at their extremities.
  • dereplication: The merged reads contain many identical sequenes. The dereplication reduces the dataset to unique sequences (ASV), and count the number of reads for each ASV in each sample-replicate.
  • data frame (or csv) in long format: The read_count_df contains one line for each occurrence with asv_id, asv, sample, replicate, read_count columns. This is the standard format of keeping occurrences throughout the analyses, and it is smaller than the wide format, if there are many ASVs present in only one or few samples.
  • data frame (or csv) in wide format: The read_count_df can be rearranged in wide format, where lines are ASVs, columns are sample(-replicates) and cells contain read counts. It is a more human friendly format and it is the base of writing an ASV table, where this information can be completed by taxonomic assignments and other informations.
  • false negative occurrence: An expected ASV in a mock sample that is not found in the data.
  • false positive occurrence: Un expected presence an ASV in a Sample.
    • all occurrences in all negative contols,
    • unexpected occurrences in mock samples
    • presence of an ASV in an incompatible habitat (e.g. ASV with high read count in samples of habitat 1 and low read count in habitat 2 is considered as FP in habitat 2).
  • habitat: Habitat type of the organisms in real or mock samples. Use this only if organisms of the different habitats cannot appear in another haditat of the dataset. Use NA otherwise.
  • long format: Each row is a single measurement. Typically, there are columns like ASV, sample, and read_count. This format is tidy and works well with many tidyverse functions.
  • Low Frequency Noise (LFN): ASVs present in low frequencies, likely to be due to errors.
  • merge: Assemble a forward and reverse read pair to one single sequence.
  • mock sample: An artificial mix of DNA of know organisms.
  • negative sample: Negative control.
  • real sample: An environmental sample.
  • replicate: Technical or biological replicate of a sample. Replicates must have numerical identifiers. (e.g. sample tops1 have replicate 1, 2 and 3).
  • sample: Name of the environmental or control sample. Must be alphanumerical, without space.
  • sample-replicate: Each sample can have technical of biological replicates. sample-replicate refers to one replicate of a given sample.
  • sample_type: real/mock/negative
  • singleton: ASV with a single read in the whole data set.
  • tag: Short sequence at the extremity of the amplicon. It is used at the demultiplexing step to identify the sample-replicate, where the read comes from.
  • taxIDs: Numerical taxonomic identifier. It can be a valid NCBI taxID, or arbitrary negative numbers for taxa not in NCBI.
  • trimming: Cut the extremities of the sequences. It can be based of sequences quality, or on the detection of a tag or primer.
  • true positive occurrence: An expected occurrence in a mock sample.
  • Wide format: In vtamR wide format data frames, each row is an ASV, and each column represents a sample. Read counts are in the cells.

Troubleshooting

tmp_FunctionName_######## in vtamR directory

These are temporary directories, that are automatically deleted at the end of the function. In some cases, if the function ends prematurely due to an error, the directory is not deleted. Just delete them manually.

LFNvariant eliminate most occurrences of a frequent ASV

LFNvariant will filter out all occurrences where

(number of reads of the ASV in a sample-replicate) / (total number of read of the ASV) < cutoff

As a consequence, if an ASV is present in most samples, there are many samples in the data set and the cutoff is relatively high, most occurrences can fall bellow the cutoff and the total read count of the ASV will decrease strongly at this step.

If the proportion of the read count of an ASV in the output compared to the input is less than min_read_count_prop, vtamR prints out a Warning. Take a close look at the variant. It can be the result of a general contamination, and in this case, the whole ASV could/should be eliminated from the data set. Otherwise, you might want to reduce the cutoff value for LFNvariant.

Memory issues for running swarm on the whole dataset

Possible workarounds:

  • Try to run Swarm sample by sample (by_sample=T).
  • Run global_read_count first to eliminate singletons then swarm.

Both solutions make swarm a bit less efficient, but at least your analyses can get through.

Memory issues in Merge on Windows

If input fastq files are very large and do not go through Merge, try working with uncompressed input files.

Files to keep

At the first steps (Merge, RandomSeq, Sortreads), many files are produced, and some of them can be very large, especially before the dereplication (before Dereplicate).

Compressing and decompressing between the different steps can take a while and can have memory issues in some cases (especially on Windows). Therefore, vtamR uses by default uncompressed files. This behavior can be changed by the compress option.

Here is a list of files I would tend to keep, delete or compress.

  • Merge: Delete the output fasta files, but keep fastainfo.csv for information on read counts after Merge.
  • RandomSeq: Compress fasta files, but keep them to ensure reproducibility in case of re-analyse.
  • SortReads: Delete the output fasta files, but keep sortedinfo.csv for information on read counts after SortReads.
  • Dereplicate: Write the dereplicated info to a file (1_before_filter.csv in the tutorial) and compress it after the analyses. This is an important file, since you can restart the analyses from here, without re-doing the longest merging, and demultiplexing steps. Compress and keep the updated ASV list (ASV_list_with_IDs.csv in the tutorial), since this can be used when analyzing subsequent data sets to synchronyse the asv_ids.
  • Output of different filtering steps: Compress them if they are too large. They are useful for tracing back the history of a sample or an ASV (see HistoryBy) or making summary files (see SummarizeBy).

No or few sequences passing merge

If the amplicon is shorter than the reads, use fastq_allowmergestagger=T.