According to your wet lab and sequencing protocol, each FASTQ file may contain one or more sample-replicates. In addition, sequences may include tags (used for demultiplexing) as well as primer sequences.

In the following sections, we present three common scenarios for generating the read_count_df data frame, which serves as the input for the filtering steps.

Set up

Load library

library(vtamR)

Set path to third party programs

# Example for Windows
cutadapt_path <- "C:/Users/Public/cutadapt"
vsearch_path <- "C:/Users/Public/vsearch-2.23.0-win-x86_64/bin/vsearch"
pigz_path <- "C:/Users/Public/pigz-win32/pigz" # optional to speed file compression
#  Example for Linux
cutadapt_path <- "~/miniconda3/envs/vtam/bin/cutadapt" # v3.4
vsearch_path <- "~/miniconda3/envs/vtam/bin/vsearch" # v2.15.1
pigz_path <- "~/miniconda3/envs/vtam/bin/pigz" # optional to speed file compression

Adapt the path to third party programs according to your installation (See Installation).

If third party programs are in your PATH (See Installation), simply omit the cutadapt_path, pigz_path, vsearch_path argument when calling the vtamR functions.

compress = FALSE In the tutorial we will use uncompressed files, but see To compress or not to compress for the best strategy for you. See also Files to keep on how to handle the large intermediate files that should be compressed or deleted once the analyses are complete.

Case 1 - One sample per fastq - no tag - no primer

In this scenario, each pair of fasta files correspond to a sample (or a replicate of a sample if you have replicates), so no demultiplexing is necessary.

The reads have been trimmed from all artificial add-ons, such as adapters, tags, indices and also from primers.

Read pairs should be quality filtered, merged and written to fasta format. This can be done by the merge_fastq_pairs function.

See the help (?merge_fastq_pairs) for setting the correct parameters for quality filtering.

Set Input

  • fastqinfo: is either a csv file, or a data frame. The key information for merge_fastq_pairs is the list of the fastq file pairs that should be merged. The tag_fw, primer_fw, tag_rv, primer_rv are irrelevant in this case, just fill them with NA.
  • fastq_dir: Directory containing the input fastq files.
  • sampleinfo_df: Output of merge_fastq_pairs. It is the updated version of fastqinfo, where fastq file names have been replaced by fasta file names and the read counts are included for each file.
  • outdir: Name of the output directory.

The demo files below are included with the vtamR package, which is why we use system.file() to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq). Make sure there is no space in the path and file names.

fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <-  system.file("extdata/demo/fastqinfo1.csv", package = "vtamR")

outdir <- "vtamR_demo_case1"
merged_dir <- file.path(outdir, "merged")

Merge Fastq File Pairs and Quality Filter Reads

sampleinfo_df <- merge_fastq_pairs(
  fastqinfo, 
  fastq_dir=fastq_dir, 
  vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
  outdir=merged_dir,
  compress=FALSE
  )

Dereplicate reads

The fasta files produced by merge_fastq_pairs can be read to a data frame and be dereplicated by the dereplicate function. See the help (?dereplicate) and tutorial more more information.

outfile <- file.path(outdir, "1_before_filter.csv")

read_count_df <- dereplicate(
  sampleinfo_df, 
  dir=merged_dir, 
  outfile=outfile
  )

Case 2 - One Aample per Fastq - Primer - No Tag

This is one of the most frequent case. Each pair of fasta files correspond to a sample (or a replicate of a sample if you have replicates), so no demultiplexing is necessary.

The reads has been trimmed from all artificial add-ons, such as adapters, tags, BUT they still have the primers.

Read pairs should be quality filtered, merged and written to fasta format by merge_fastq_pairs function as in the previous section.

Then the trim_primer function will trim the primers from the reads. See the help (?trim_primer) for setting the correct parameters for primer trimming.

Set Input

  • fastqinfo: Either a csv file, or a data frame.
    The key information for merge_fastq_pairs is the list of the fastq file pairs that should be merged. The primer_fw, primer_rv columns are irrelevant in this case, just fill them with NA.
  • fastq_dir: Directory containing the input fastq files.
  • fastainfo_df: is the output of merge_fastq_pairs. It is the updated version of fastqinfo, where fastq file names have been replaced by fasta file names containing the merged sequences.
  • fasta_dir: Directory containing the input fasta files for trim_primer. This directory is created by merge_fastq_pairs.
  • If check_reverse is TRUE, trim_primer checks the reverse complementary strand as well.
  • sampleinfo_df: is updated version of fastainfo. This data frame and the files listed in it are the input for dereplicate.
  • outdir: Name of the output directory.

The demo files below are included with the vtamR package, which is why we use system.file() to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq). Make sure there is no space in the path and file names.

fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <-  system.file("extdata/demo/fastqinfo2.csv", package = "vtamR")

outdir <- "vtamR_demo_case2"
merged_dir <- file.path(outdir, "merged")

Merge Fastq File Pairs and Quality Filter Reads

fastainfo_df <- merge_fastq_pairs(
  fastqinfo, 
  fastq_dir=fastq_dir, 
  vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
  outdir=merged_dir,
  compress=FALSE
  )

Trim Primers


demultiplexed_dir <- file.path(outdir, "demultiplexed")
sampleinfo_df <- trim_primers(
  fastainfo_df, 
  fasta_dir=merged_dir, 
  outdir=demultiplexed_dir, 
  cutadapt_path=cutadapt_path, # can be omitted if CUTADAPT is in the PATH
  vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
  check_reverse=T,
  primer_to_end=F,
  compress=FALSE
  )

Dereplicate

The fasta files produced by TrimPrimer can be read to a data frame and be dereplicated by the Dereplicate function. See the help (?Dereplicate) and tutorial more more information.

outfile <- file.path(outdir, "1_before_filter.csv")

read_count_df <- dereplicate(
  sampleinfo_df, 
  dir=demultiplexed_dir, 
  outfile=outfile
  )

Case 3 - Several Samples per Fastq - Tags - Primers

In this case, one pair of fastq files contains reads from multiples samples or sample-replicates, so it is necessary to demultiplex them, and trim from tags and primers.

Read pairs should be quality filtered, merged and written to fasta format as in the previous sections.

Then the demultiplex_and_trim function will demultiplex the fasta files according to the tag combinations and trim the primers from the reads.

See the help (?demultiplex_and_trim) for setting the correct parameters for demultiplexing and primer trimming:

Set Input

  • fastqinfo: Either a csv file, or a data frame. The key information for merge_fastq_pairs is the list of the fastq file pairs that should be merged.
  • fastq_dir: Directory containing the input fastq files.
  • fastainfo_df: Output of merge_fastq_pairs. It is the updated version of fastqinfo, where fastq file names have been replaced by fasta file names.
  • fasta_dir: Directory containing the input fasta files for demultiplex_and_trim. This directory is created by merge_fastq_pairs.
  • If check_reverse is TRUE, demultiplex_and_trim checks the reverse complementary stand as well.
  • sampleinfo_df: Updated version of fastainfo. This data frame and the files listed in it are the input of the dereplicate.
  • outdir: Name of the output directory.

The demo files below are included with the vtamR package, which is why we use system.file() to access them in this tutorial. When using your own data, simply provide the file and directory names (e.g. ~/vtamR/fastq). Make sure there is no space in the path and file names.

fastq_dir <- system.file("extdata/demo/fastq", package = "vtamR")
fastqinfo <-  system.file("extdata/demo/fastqinfo_mfzr_plate1.csv", package = "vtamR")

outdir <- "vtamR_demo_case3"
merged_dir <- file.path(outdir, "merged")
demultiplexed_dir <- file.path(outdir, "demultiplexed")

Merge Fastq File Pairs and Quality Filter Reads

fastainfo_df <- merge_fastq_pairs(
  fastqinfo, 
  fastq_dir=fastq_dir, 
  vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
  outdir=merged_dir,
  compress=FALSE
  )

Demultiplex, Trim off Tags and Primers**

sampleinfo_df <- demultiplex_and_trim(
  fastainfo_df, 
  fasta_dir=merged_dir, 
  outdir=demultiplexed_dir, 
  check_reverse=TRUE, 
  cutadapt_path=cutadapt_path, # can be omitted if CUTADAPT is in the PATH
  vsearch_path=vsearch_path, # can be omitted if VSEARCH is in the PATH
  compress=FALSE
  )

Dereplicate

The fasta files produced by demultiplex_and_trim can be read to a data frame and be dereplicated by the Dereplicate function. See the help (?dereplicate) and tutorial more more information.

outfile <- file.path(outdir, "1_before_filter.csv")
read_count_df <- dereplicate(
  sampleinfo_df, 
  dir=demultiplexed_dir, 
  outfile=outfile
  )