What is it and when to use it

The mock_composition file is essential to describe which ASVs are expected in each mock sample.

It is a CSV file with the following columns:

  • sample: Name of the mock sample
  • action:
    • keep: Expected ASV in the mock, that should be kept in the data set
    • tolerate: ASV that can be present in a mock, but it is not essential to keep it in the data set (e.g. badly amplified organism)
  • asv: sequence of the ASV
  • taxon: Optional; Name of the organism
  • asv_id: Optional; If there is a conflict between asv and asv_id, the asv_id is ignored

The known_occurrences file is essential for running the following functions, either to determine the optimal parameter values for the LFN filters or to evaluate the filtering procedure in terms of precision (TP / (TP + FP)) and sensitivity (TP / (TP + FN)).

  • OptimizePCRerror
  • OptimizeLFNsampleReplicate
  • OptimizeLFNreadCountLFNvariant
  • MakeKnownOccurrences
  • ASVspecificCutoff

The mock_composition is also useful, although not essential for the WriteASVtable function if you wish to add a column in the output to easily find expected occurrences in each mock sample.

Constructing the mock_composition File

I suggest two different methods to construct the mock_composition file:

  1. Using the MakeMockCompositionLTG function
  2. Prefiltering the dataset and manually selecting the expected ASVs

1. Using MakeMockCompositionLTG

This function provides a convenient and fast way to build the mock_composition file. However, some expected occurrences may occasionally be missed. If that happens, you can manually select the expected ASVs from a prefiltered dataset (as descirbed later).

Prerequisites

  • Reference sequences

    Collect a reference sequence for each species expected in the mock samples. Each reference should cover at least 70% of the region amplified by the primers. It may be longer or slightly shorter than the ASV and may differ slightly from the exact expected sequence.

    The reference sequences must be in FASTA format, and the FASTA headers should include a valid NCBI taxonomic identifier in the following format:

    >SequenceName taxID=12345

    The taxID must correspond to a valid entry in the NCBI Taxonomy database.

  • Taxonomy file

    The function requires a taxonomy file. The file distributed with the COInr database is suitable for Eukaryotes, even if it is not a COI marker. (see TaxAssign reference data base).

  • Read count data

    Provide a read_count_df data frame containing read counts for each mock sample. It is recommended to use a dataset that has already undergone initial filtering steps to remove artefactual ASVs and reduce drastically the number of ASV (e.g., after Denoising with SWARM, LFNglobalReadCount, FilterIndel, FilterCodonStop, FilterExternalContaminant, and FilterChimera).

Function Overview

The MakeMockCompositionLTG function performs the following steps:

  1. Creates a small BLAST database from the reference sequences corresponding to the expected species or closely related taxa.
  2. Assigns taxonomy to all ASVs detected in the mock samples using this custom database. Since the database is small, the assignment is fast.
  3. Selects the most abundant ASV for each taxon.
  4. Generates a mock_composition template file, which should be reviewed and, if necessary, edited by the user.

Output

The main output file is mock_composition_template_to_check.csv. This file serves as a template for the final mock_composition file.

It contains the most abundant sequence for each ltg_name identified in the taxonomic assignment output, repeated for each mock sample. If different mock samples have distinct compositions, you should remove the lines corresponding to taxa that are not expected in a particular sample.

Note: The file does not include sequences for species in the custom database that did not show significant similarity to any ASV. This may occur if:

  • the species was not amplified in the mock samples, or
  • the reference sequence in the custom database is incorrect.

We will use some of the files created by the first part of the Tutorial (Till the FilterRenkonen)

  • The demo files are included in the vtamR package, hence the use of system.file(). When using your own data just enter your file names.
  • read_count_file is the output of FilterRenkonen of the Tutorial.
  • The blast_db and taxonomy are set up as in the Tutorial
library(vtamR)

read_count_file <- system.file("extdata/demo/7_FilterChimera.csv", package = "vtamR")
reference_mock_fasta <- system.file("extdata/demo/mock_ncbi.fasta", package = "vtamR")
sampleinfo <- system.file("extdata/demo/sampleinfo.csv", package = "vtamR")
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_path <- "blastn" # Adapt this if BLAST is not in your PATH

outdir_mock <- "mock_composition"

mock_template <- MakeMockCompositionLTG(read_count=read_count_file,
                                        fas=reference_mock_fasta,
                                        taxonomy=taxonomy,
                                        sampleinfo = sampleinfo,
                                        outdir= outdir_mock;
                                        blast_path=blast_path)

Note: If BLAST is in your PATH (see Installation), you you can omit the blast_path argument.

2. Prefiltering the dataset and manually picking the expected ASVs

The idea is to

  • Prefilter your data set
  • Assign ASV to taxa
  • Examine the ASV in the mock samples and their read counts and pick the correct sequences.

I suggest that you start by filtering/denoising your data set by using at least some of the following functions as in the Tutorial. This will eliminate most of the erroneous ASV, so it will be easier to identify the expected ASV from your mock samples.

Set parametres and access the demo files

We will use some of the files created by the first part of the Tutorial (Till the FilterChimera)

  • The demo files are included in the vtamR package, hence the use of system.file(). When using your own data just enter your file names.
  • read_count_file is the output of FilterRenkonen of the Tutorial.
  • The blast_db and taxonomy are set up as in the Tutorial
library(vtamR)
library(dplyr)

read_count_file <- system.file("extdata/demo/7_FilterChimera.csv", package = "vtamR")
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
sampleinfo <- system.file("extdata/demo/sampleinfo.csv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
blast_path <- "blastn" # Adapt this if BLAST is not in your PATH

Let’s limit the analyses to the mock samples


read_count_df <- read.csv(read_count_file)
sampleinfo_df <- read.csv(sampleinfo)

# select mock samples in sampleinfo
mock_samples <- sampleinfo_df %>%
  filter(sample_type == "mock")
# select mock samples from read_count_df
read_count_mock <- read_count_df %>%
  filter(sample %in% mock_samples$sample)

Assign taxa to ASVs

TaxAssignLTG will assign all ASV in the input csv file or data frame (read_count_file).

See more details of taxonomic assignment here.

Note: If BLAST is in your PATH (see Installation), you you can omit the blast_path argument.

asv_tax <- TaxAssignLTG(asv=read_count_mock, 
                     taxonomy=taxonomy, 
                     blast_db=blast_db,
                     quiet=TRUE,
                     blast_path = blast_path 
                     )

Make an ASV table with taxonomic assignments

Make a data frame with ASVs and read counts in the wide format and add their taxonomic assignment. This format is easier to read for humans, than the read_count_df.

See details of WriteASVtable here.


asv_table_mock <- WriteASVtable(read_count_mock, 
                               sampleinfo=sampleinfo, 
                               asv_tax=asv_tax,
                               pool_replicates=TRUE)

Sort the output by the taxon name and then by decreasing read count.

asv_table_mock <- asv_table_mock %>%
  arrange(ltg_name, desc(tpos1))

Let’s see the ASV present in tpos1.

knitr::kable(asv_table_mock, format = "markdown")
asv_id tpos1 ltg_taxid ltg_name ltg_rank ltg_rank_index domain_taxid domain kingdom_taxid kingdom phylum_taxid phylum class_taxid class order_taxid order family_taxid family genus_taxid genus species_taxid species pid pcov phit taxn seqn refres ltgres asv
219 132 6656 Arthropoda phylum 3.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda NA NA NA NA NA NA NA NA NA NA 97 70 70 1 1 8 8 ACTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTAGGAACTTTACTTTCATTATTTATTCGTCTTGAATTAGCTTATCCAGGAAATCAATTTTTTTTAGGAAATCACCAACTTTATAATGTGGTTGTGACAGCACATGCTTTTATCATGATTTTTTTCATGGTTATGCCGATTTTAATC
211 83 6656 Arthropoda phylum 3.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda NA NA NA NA NA NA NA NA NA NA 97 70 70 1 1 8 8 ACTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTAGGAACTTTACTTTCATTATTTATTCGACTAGAATTAGCTTATCCAGGAAATCAATTTTTTTTAGGAAATCACCAACTTTATAATGTGGTTGTGACAGCACATGCTTTTATCATGATTTTTTTCATGGTTATGCCGATTTTAATC
2197 19 6656 Arthropoda phylum 3.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda NA NA NA NA NA NA NA NA NA NA 90 70 70 3 3 7 8 TCTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTCGGAACTTTACTTTCATTATTTATTCGTCTGGAATTAGCATACCCAGGAAATCAATTTTTTTTAGGAAACCACCAACTTTATAATGTAGTTGTAACAGCACATGCTTTTATTATGATTTTTTTTATGGTTATGCCAATTTTAATC
2264 442 1077837 Baetis fuscatus species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30073 Ephemeroptera 172515 Baetidae 189838 Baetis 1077837 Baetis fuscatus 100 70 70 1 1 8 8 TTTATATTTCATTTTTGGTGCATGATCAGGTATGGTGGGTACTTCCCTTAGTTTATTAATTCGAGCAGAACTTGGTAATCCTGGTTCTTTGATTGGCGATGATCAGATTTATAACGTTATTGTCACTGCCCATGCTTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT
6 4303 189839 Baetis rhodani species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30073 Ephemeroptera 172515 Baetidae 189838 Baetis 189839 Baetis rhodani 100 70 70 1 1 8 8 TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT
6033 12 189839 Baetis rhodani species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30073 Ephemeroptera 172515 Baetidae 189838 Baetis 189839 Baetis rhodani 97 70 70 1 1 8 8 TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGATCATGGCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT
531 3 189839 Baetis rhodani species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30073 Ephemeroptera 172515 Baetidae 189838 Baetis 189839 Baetis rhodani 100 70 70 1 1 8 8 ACTTTATTTTATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT
1 294 1592914 Caenis pusilla species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30073 Ephemeroptera 197146 Caenidae 197147 Caenis 1592914 Caenis pusilla 100 70 70 1 1 8 8 ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC
7474 4 7149 Chironomidae family 6.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae NA NA NA NA 85 70 70 4 4 6 7 ACTATATTTTATTTTTGGGGCATGGTCAGGAATAGTTGGTACTTCCCTTAGTATCCTAATTCGAGCTGAACTAGGACATGCCGGCTCCCTAATTGGAGACGATCAAATTTATAATGTAATCGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT
6319 2 41828 Chironomoidea superfamily 5.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera NA NA NA NA NA NA 85 70 70 4 4 6 7 TTTATATTTTATTTTTGGTATTTGATCAGGTATAGTGGGTACTTCTTTGAGCTTAATAATTCGTACAGAATTAGGTCAGCCAGGTTATTTAATTGGAGATGACCAAATTTATAATGTTATTGTAACTGCTCATGCTTTTATTATAATTTTCTTTATAGTGATACCTATTATAATT
5760 4 33392 Endopterygota cohort 4.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta NA NA NA NA NA NA NA NA 85 70 70 4 4 6 7 CCTTTATTTTATTTTTGGTGCTTGATCTGGTATAGTTGGTACTTCTTTAAGAATGCTAATTCGAGCAGAATTAGGACGTCCAGGAACATTTATTGGAGATGACCAAGTTTATAATGTTATTGTAACAGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTTTAATT
4 16292 869943 Hydropsyche pellucidula species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30263 Trichoptera 41030 Hydropsychidae 50443 Hydropsyche 869943 Hydropsyche pellucidula 100 70 70 1 1 8 8 CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT
5298 2 869943 Hydropsyche pellucidula species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 30263 Trichoptera 41030 Hydropsychidae 50443 Hydropsyche 869943 Hydropsyche pellucidula 97 70 70 1 1 8 8 CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATAATTTTTTTTATAGTTATACCAATCATAATT
5756 11 43808 Orthocladiinae subfamily 6.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae NA NA NA NA 90 70 70 3 3 7 8 CCTTTATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT
390 6 43808 Orthocladiinae subfamily 6.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae NA NA NA NA 90 70 70 3 3 7 8 ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCAATCATAATT
1677 2 43808 Orthocladiinae subfamily 6.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae NA NA NA NA 90 70 70 3 3 7 8 CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTGATACCTATTATAATT
5917 2 43808 Orthocladiinae subfamily 6.5 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae NA NA NA NA 90 70 70 3 3 7 8 TCTATATTTCATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT
1753 8 1437201 Pentapetalae clade 4.5 2759 Eukaryota 33090 Viridiplantae 35493 Streptophyta 3398 Magnoliopsida NA NA NA NA NA NA NA NA 100 70 70 1 1 8 8 TCTATATTTCATCTTCGGTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGATCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTTTTTATGGTTATGCCGGCGATGATA
3 216 58324 Phoxinus phoxinus species 8.0 2759 Eukaryota 33208 Metazoa 7711 Chordata 186623 Actinopteri 7952 Cypriniformes 2743726 Leuciscidae 42662 Phoxinus 58324 Phoxinus phoxinus 100 70 70 1 1 8 8 CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT
27 212 33317 Protostomia clade 2.5 2759 Eukaryota 33208 Metazoa NA NA NA NA NA NA NA NA NA NA NA NA 80 70 70 4 4 6 7 ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT
2 1608 1042866 Rheocricotopus chalybeatus species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae 611384 Rheocricotopus 1042866 Rheocricotopus chalybeatus 97 70 70 1 1 8 8 ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT
414 15 1042866 Rheocricotopus chalybeatus species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae 611384 Rheocricotopus 1042866 Rheocricotopus chalybeatus 97 70 70 1 1 8 8 ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCAATCATAATT
1788 2 1042866 Rheocricotopus chalybeatus species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae 611384 Rheocricotopus 1042866 Rheocricotopus chalybeatus 97 70 70 1 1 8 8 TCTATATTTCATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT
2299 7 1216507 Simulium balcanicum species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7190 Simuliidae 7191 Simulium 1216507 Simulium balcanicum 100 70 70 1 1 8 8 TTTATATTTTATTTTTGGAGCCTGAGCTGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCCGAATTAGGACACCCAGGCTCTCTAATTGGAGACGACCAAATTTATAATGTAATTGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATGCCAATTATAATT
2295 10 697243 Simulium lineatum species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7190 Simuliidae 7191 Simulium 697243 Simulium lineatum 100 70 70 1 1 8 8 TTTATATTTTATTTTTGGAGCCTGAGCTGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCCGAATTAGGACACCCAGGATCTCTAATTGGAGACGACCAAATTTATAATGTAATTGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATACCAATTATAATT
569 2 1419339 Simulium pseudequinum species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7190 Simuliidae 7191 Simulium 1419339 Simulium pseudequinum 100 70 70 1 1 8 8 ATTATATTTTATTTTTGGGGCCTGAGCAGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCTGAATTAGGACACCCAGGATCTTTAATTGGTGATGACCAAATTTATAATGTAATTGTTACAGCTCATGCTTTCGTAATAATTTTTTTTATAGTTATACCAATTATAATT
5 338 611678 Synorthocladius semivirens species 8.0 2759 Eukaryota 33208 Metazoa 6656 Arthropoda 50557 Insecta 7147 Diptera 7149 Chironomidae 611392 Synorthocladius 611678 Synorthocladius semivirens 97 70 70 1 1 8 8 CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT
597 156 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTGTACCTTATATTTGCCTTATTTTCAGGGCTATTAGGTACTGCTTTTTCTGTTTTAATAAGACTTGAATTATCAGGACCTGGTGTACAATACATAGCTGATAACCAACTTTATAACAGTATAATTACTGCACATGCAATACTTATGATTTTCTTCATGGTTATGCCTGCTATGATA
170 85 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTCTATTTAATATTTGCTGCATTTTCAGGGGTTATAGGAACAATATTTTCTATAATTATAAGAATGGAACTTGCTTATCCAGGTGATCAAATATTGAATGGTAATCACCAACTTTATAATGTTATTGTAACTGCTCATGCATTTGTAATGATTTTTTTTATGGTTATGCCTGCCTTGATT
557 48 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTATACCTTATCTTTTCTCTTTTCTCAGGTTTACTTGGAACAGCATTTTCAGTTTTAATAAGACTTGAATTATCTGGACCTGGTGTTCAGTACATAGCAGACAATCAGTTATACAATAGTATTATTACAGCACACGCAATATTAATGATTTTCTTTATGGTTATGCCAGCAATGATT
6207 44 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTTTATATTATTTACGGCTTTCTCATAGGATTGGTTGGTACATTTTTTTCCGCTGTCATTCGTATTCAACTCATGTACCCTGGTTCGTTGTTTTTGGGTGGTAATTACCATTATTATAATACTGTAATTACAGCGCACGCACTTGTGATAATTTTTTTTATGGTCATACCAGTGTTGATT
2373 34 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTTTACTTAATTTTTGGTGCTATTTCTGGTGTAGCTGGAACTGCTTTATCACTTTACATCAGATTTACATTATCTCAACCAAACTCGAGTTTTTTAGAATATAACCACCATTTATATAATGTAATTGTTACAGGACATGCACTTATAATGGTTTTTTTTGTAGTAATGCCTATTTTAATT
249 24 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTTTATTTTATTTTCGGAGCGTGGTCGGGGATGGTAGGCACATCTCTGAGTCTTTTAATTCGAGCCGAATTGGGTAATCCTGGTTCACTAATTGGGGATGACCAGATTTACAACGTTATTGTAACAGCCCATGCTTTTATTATGATTTTTTTTATAGTAATGCCAATTATGATT
7442 22 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA AATGTATCTAATATTTGCAATATTTGCAGGCATTGTTGGTGGACTAATGTCAGTGATACTCAGGCTAGAACTCGCACAACCTGGTAACCAGTTTTTAGGCGGCGATCATCAATTTTATAATGTTATGCTCACTGCTCACGCACTTGTCATGGTATTTTTTATGATTATGCCTGGGCTTTTC
8681 16 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TTTATATTTAATATTTGGGCTAATAGCGGGTGTAATAGGAACGTTATTTTCGATATTAATTAGATTAGAATTAGCCTATCCAGGGAATCAATATTTTTTGGGAGATCATCAATTTTATAATGTTGTTGTTACATCACATGCGTTTATTATGATTTTTTTTATGGTAATGCCGGCATTTGTT
1711 15 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA CTTGTACATGATTTTCGGAACGTTGGCCGGAGTGGTCGGAACGACGTTGTCGGTATGGATGCGAATGGAATTGGCGGCACCGGGAGTGCAAGCATTGTCGGGAAACCATCAGTTGTATAACGTGATGGTGACGGCACATGCCTTCATCATGATTTTCTTCTTCGTGATGCCCTTTTTGATT
646 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA CCTATACCTAGTATTTGCAGTATTTGCAGGTATAATTGGTACAGCATTTTCAGTACTAATTCGTATGGAACTTGCAGCACCAGGAGTACAATATCTTAACGGAGATCACCAACTTTATAATGTAGTTATTACTGCACATGCGCTAATTATGATTTTCTTTATGGTTATGCCTGCTCTCGTG
1636 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA CTTATATCTATTATTTGCAGGCGTTTCGGGTATTGCCGGCACTGTTTTATCTTTATATATACGAGCTACACTAGCAACTCCTGCTTCCAATTTTTTAAGCAAAAATCATCACTTGTATAACGTAATAGTGACAGGCCATGCGTTTTTAATGATTTTTTTTTTAGTAATGCCTGCTCTTATA
626 11 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTGTACTTGATTTATGGGGGATTTGCTGGTTTAATTGGAACGATGTTCTCTGTTCTAATAAGAATGGAACTATCATCACCCGGTAATACTATACTAGCTGGTAACTATCAATACTATAATGTTATAGTAACTGCGCATGCTTTCATTATGATCTTCTTTTTTGTTATGCCTGCTATGATG
7469 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTATACTTAATCTTTGCATTATTTTCTGGATTATTAGGTACAGCGTTTTCTGTTCTTATAAGATTAGAATTAAGTGGGCCAGGTGTTCAATATATAGCGGACAATCAACTATACAACAGTGTTATTACAGCACACGCTATCTTAATGATATTCTTTATGGTTATGCCTGCAATGATA
4781 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTATACCTTATATTTTCTCTGTTTTCGGGTTTACTTGGAACCGCTTTTTCAGTTTTAATAAGACTTGAATTATCTGGACCTGGTGTTCAGTACATAGCAGATAACCAATTATACAATAGTATAATTACAGCACACGCGATACTTATGATTTTCTTTATGGTTATGCCAGCAATGATT
2190 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTTTACCTGATCTTCGCCGTATTCTCAGGAATGATTGGTACAGCATTCAGTGTAATTATTCGAATGGAACTTGCTGCGCCCGGTGTGCAATACCTTCACGGTAACCACCAACTATATAACGTAATTATTACAGCCCACGCCTTCCTAATGATCTTTTTCATGGTTATGCCTGGTCTTGTG
7 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA AATGTATCTAATCTTTGGAGGTTTTTCTGGTATTATTGGAACAGCTTTATCTATTTTAATCAGAATAGAATTATCGCAACCAGGAAACCAAATTTTAATGGGAAACCATCAATTATATAATGTAATTGTAACTTCTCACGCTTTTATTATGATTTTTTTTATGGTAATGCCAATTTTATTA
571 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTATATTTTCTTTTCGGTACACTATCCGGTGTTATAGGAACAATTTTATCTTTACTTATACGCTTGGAATTAGCATATCCGGGAAATCAATTTTTTTTAGGTAATCATCAATTATACAATGTCGTAGTTACAGCCCATGCATTTTTAATGATTTTTTTTATGGTAATGCCTGTTTTAATT
3839 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TTTATACTTATTATTTGCTGTTTTAGCAGGAGTTGTAGGAACATATTTTTCTGCTTTAATCAGAATAGAGTTAGCATATCCTGGTAATGGAATTTTTAACGGTAATTTTCAACTTTATAATGTTGTAGTAACAGCGCATGCTTTTATTATGATTTTCTTTTTAGTAATGCCAGCAATGATT
4431 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTATAGCTCACGCATTCCTAATGACCTTTTTCATGGTTATGCCAGGACTTGTT
8658 5 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTTTATATTCTTTTTGGAGCTATTGCAGGAGTATGTGGTACTGCAGTCTCCGTAGCGATTAGATTAGAACTTGCTCAACCAGGTGCAGGTATACTATCGTCTAATCACCAGTTATACAATGTTTTTATTACAGCTCATGCTATTTTAATGATTTTTTTCATGGTCATGCCTATTCTTATA
180 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTGTATTTAATATTTGGTGGCTTTTCGGGTATTATAGGTACTATATTCTCTATGATTATAAGATTAGAATTGGCTGCGCCCGGCTCTCAAATATTAGGTGGTAATAGCCAACTTTATAATGTAATTATTACTGCGCATGCTTTTGTTATGATTTTCTTTTTTGTTATGCCTGTTATGATA
1731 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCTATGATC
5856 4 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA GTTATATTTAATATTTAGTATAATAGCAGGTTTAGTTGGTACGTGATTTTCAATAATGATAAGAACAGAATTAGCATATCCAGGTTTTCAATATTTTAATGGAGATTTACAACATTATAATGTGATAATTACAGGACATGCGTTCATTATGATATTTTTCATGGTAATGCCAGCATTAATT
566 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTATATCTTATATTTGCAGCCTTCTCTGGTATAATAGGAACTATTTTTTCTATTATTATAAGAATGGAATTAGCATTTCCAGGAGATCAAGTTTTGGGCGGTAATCATCAACTTTATAATGTTATTGTCACTGCACACGCTTTTTTAATGATATTTTTTATGGTTATGCCCGCTCTTATT
610 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTGTACCTTATATTTGCCTTATTTTCAGGGCTATTAGGTACTGCTTTTTCTGTTTTAATAAGACTTGAATTATCAGGACCTGGTGTACAATACATAGCTGATGACCAACTTTATAACAATATAATTACTGCACATGCAATACTTATGATTTTCTTCATGGTTATGCCTGCTATGATA
4787 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ATTATATTTAATATTTGGGGGTATCTCAGGTGTAGCAGGGACTGTATTATCCTTATACATACGAATAACACTATCGCACCCAGAAGGAAATTTTTTAGAACACAATCACCACTTATACAATGTTATTGTAACAGGTCATGCTTTTGTTATGATTTTTTTTATGGTAATGCCTGTTCTTATC
10 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ACTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCCATGATT
2192 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TCTTTATCTTATATTTGCATTATTTTCAGGGCTTTTAGGTACAGCTTTTTCTGTTTTAATTAGACTAGAATTATCTGGACCTGGAGTACAATACATAGCAGACAACCAATTATACAACAGTATAATAACTGCGCATGCTATTCTGATGATATTTTTCATGGTAATGCCTGCAATGATA
2217 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA TTTATATATGATTTTTGCAGCCTTTTCAGGAATTGTAGGGACTGTATTTTCAATGTTAATTCGATTTGAATTAGCACATCCAGGACATCAAATTTTATCTGGAAATAACCAATTATACAACGTTATCGTAACGGCACATGCTTTTGTAATGATTTTCTTCATGGTAATGCCTGCATTAATT

In this mock sample, there should be the following 6 species:

  • Caenis pusilla
  • Rheocricotopus
  • Phoxinus phoxinus
  • Hydropsyche pellucidula
  • Synorthocladius semivirens
  • Baetis rhodani

We can see that in spite of all the filtering we have done so far, there are still a lot of unexpected occurrences in this sample. Most of them have low read counts and could be filtered out by Low Frequency Noise Filters

Select the expected ASV and make mock_composition

You can now pick the correct sequences of the expected ASVs in each mock and make the mock_composition file.