According to your wetlab and sequencing protocol each fastq files can contain one or more sample-replicates, and sequences may or may not contain tags (for demultiplexing) and primer sequences. In the following sections I show 3 different scenarios to obtain the read_count_df data frame, which the input to the filtering steps.
Load library
library(vtamR)
Set path to third party programs
# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
# Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt" # v3.4
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch" # v2.15.1
Set general parameters
num_threads <- 8
sep <- ","
num_threads
: Number of CPUs for multithreaded programssep
: Separator used in csv filesIn this scenario, each pair of fasta files correspond to a sample (or a replicate of a sample if you have replicates), so no demultiplexing is necessary.
The reads has been trimmed from all artificial add-ons, such as adapters, tags, indices and also from primers.
Read pairs should be quality filtered, merged and written to fasta format. This can be done by the Merge
function.
See the help (?Merge
) for setting the correct parameters for quality filtering.
Set input
Merge
is the list of the fastq file pairs that should be merged. The tag_fw
, primer_fw
, tag_rv
, primer_rv
are irrelevant in this case, just fill them with NA
.fastq_dir
: Directory containing the input fastq files.Merge
. It is the updated version of fastqinfo, where fastq file names have been replaced by fasta file names and the read counts are included for each file.outdir
: Name of the output directory.The demo files below are included with the vtamR
package, which is why we use system.file()
to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq
). Make sure there is no space in the path and filenames.
fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <- system.file("extdata/demo/fastqinfo1.csv", package = "vtamR")
outdir <- "vtamR_demo_case1"
merged_dir <- file.path(outdir, "merged")
Merge fastq file pairs and quality filter reads
sortedinfo_df <- Merge(fastqinfo,
fastq_dir=fastq_dir,
vsearch_path=vsearch_path,
outdir=merged_dir
)
Dereplicate
The fasta files produced by Merge
can be read to a data frame and be dereplicated by the Dereplicate
function. See the help (?Dereplicate
) and tutorial more more information.
outfile <- file.path(outdir, "1_before_filter.csv")
read_count_df <- Dereplicate(sortedinfo_df,
dir=merged_dir,
outfile=outfile
)
This is one of the most frequent case. Each pair of fasta files correspond to a sample (or a replicate of a sample if you have replicates), so no demultiplexing is necessary.
The reads has been trimmed from all artificial add-ons, such as adapters, tags, BUT they still have the primers.
Read pairs should be quality filtered, merged and written to fasta format by Merge
function as in the previous section.
Then the TrimPrimer
function will trim the primers from the reads. See the help (?TrimPrimer
) for setting the correct parameters for primer trimming.
Set input
Merge
is the list of the fastq file pairs that should be merged. The primer_fw
, primer_rv
columns are irrelevant in this case, just fill them with NA
.fastq_dir
: Directory containing the input fastq files.Merge
. It is the updated version of fastqinfo
, where fastq file names have been replaced by fasta file names.fasta_dir
: Directory containing the input fasta files for TrimPrimer
. This directory is created by Merge
.check_reverse
is TRUE, TrimPrimer
checks the reverse complementary strand as well.sortedinfo_df
: is updated version of fastainfo. This data frame and the files listed in it are the input for Dereplicate
.outdir
: Name of the output directory.The demo files below are included with the vtamR
package, which is why we use system.file()
to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq
). Make sure there is no space in the path and filenames.
fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <- system.file("extdata/demo/fastqinfo2.csv", package = "vtamR")
outdir <- "vtamR_demo_case2"
merged_dir <- file.path(outdir, "merged")
Merge fastq file pairs and quality filter reads
fastainfo_df <- Merge(fastqinfo,
fastq_dir=fastq_dir,
vsearch_path=vsearch_path,
outdir=merged_dir
)
Trim primers
sorted_dir <- file.path(outdir, "sorted")
sortedinfo_df <- TrimPrimer(fastainfo_df,
fasta_dir=merged_dir,
outdir=sorted_dir,
cutadapt_path=cutadapt_path,
vsearch_path=vsearch_path,
check_reverse=T,
primer_to_end=F
)
Dereplicate
The fasta files produced by Merge
can be read to a data frame and be dereplicated by the Dereplicate
function. See the help (?Dereplicate
) and tutorial more more information.
outfile <- file.path(outdir, "1_before_filter.csv")
read_count_df <- Dereplicate(sortedinfo_df,
dir=sorted_dir,
outfile=outfile
)