The mock_composition file defines which ASVs are expected in each mock sample and is a key input for the analysis.
It is provided as a CSV file with the following columns:
sample: name of the mock sample
action:
keep: ASV expected in the mock that should be retained
in the datasettolerate: ASV that may be present in the mock but is
not critical to retain (e.g. poorly amplified organism)asv: ASV sequence
taxon: optional; organism name
asv_id:
optional; if both asv and asv_id are provided
and conflict, asv_id is ignored
The known_occurrences file is required to run the following functions:
suggest_pcr_error_cutoffsuggest_sample_cutoffsuggest_variant_readcount_cutoffscompute_asv_specific_cutoffclassify_control_occurrencesThese functions are used to:
classify_control_occurrencessuggest_pcr_error_cutoff,
suggest_sample_cutoff,
suggest_variant_readcount_cutoffs,
compute_asv_specific_cutoff)The mock_composition
file is also useful—though not required—for the
write_asv_table function, if you want to include an output
column that highlights expected occurrences in each mock sample.
mock_composition FileI suggest two different methods to construct the
mock_composition file:
match_variants_to_mock_species functionmatch_variants_to_mock_speciesThis function offers a quick and convenient way to generate a
mock_composition file. However, it may occasionally miss
some expected occurrences. In such cases, you can manually retrieve the
expected ASVs from a prefiltered dataset (see below).
Reference sequences
Prepare one reference sequence for each species expected in the mock samples. Each sequence should span at least 70% of the region targeted by the primers. It can be longer or slightly shorter than the ASV and may include minor mismatches relative to the true sequence.
Reference sequences must be provided in FASTA format, with headers including a valid NCBI taxonomic identifier:
>SequenceName taxID=12345
The taxID must correspond to an entry in the NCBI Taxonomy
database.
Taxonomy file
A taxonomy file is required. The file distributed with the COInr database is appropriate, even for non-COI markers (see TaxAssign reference data base).
Read count data
Provide a read_count_df data frame containing read
counts for each mock sample. For better performance, it is recommended
to use data that has already undergone initial filtering to remove
artefactual ASVs and substantially reduce dataset size (e.g. after
Denoising with SWARM, LFNglobalReadCount,
FilterIndel, FilterCodonStop,
FilterExternalContaminant, and
FilterChimera).
Although this improves speed, the function can also be run on unfiltered data.
The match_variants_to_mock_species function performs the
following steps:
mock_composition template
file, which should be reviewed and edited if necessary.match_variants_to_mock_speciesWe will use files generated in the first part of the Tutorial (up to filter_replicate).
vtamR package (hence the
use of system.file()). When using your own data, simply
replace these with your file paths.read_count_file corresponds to the output of
filter_replicate from the Tutorial.blast_db and taxonomy are configured as
described in the Tutorial.library(vtamR)
read_count_file <- system.file("extdata/demo/8_filter_replicate.csv", package = "vtamR")
reference_mock_fasta <- system.file("extdata/demo/mock_ncbi.fasta", package = "vtamR")
sampleinfo <- system.file("extdata/demo/sampleinfo_mfzr_plate1.csv", package = "vtamR")
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
blast_path <- "blastn" # Adapt this if BLAST is not in your PATH
outdir_mock <- "mock_composition"
mock_template <- match_variants_to_mock_species(
read_count=read_count_file,
fas=reference_mock_fasta,
taxonomy=taxonomy,
sampleinfo = sampleinfo,
outdir= outdir_mock,
blast_path=blast_path
)
Note: If BLAST is available in your PATH (see Installation), you can omit the
blast_path argument.
The main output file is
mock_composition_template_to_check.csv,
which serves as a template for the final mock_composition
file.
mock_composition <- file.path(outdir_mock, "mock_composition_template_to_check.csv")
mock_composition_df <- read.csv(mock_composition)
knitr::kable(mock_composition_df, format = "markdown")
| sample | action | asv | taxon | asv_id |
|---|---|---|---|---|
| tpos1 | keep | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT | Baetis rhodani | 5 |
| tpos1 | keep | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC | Caenis pusilla | 1 |
| tpos1 | keep | CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT | Hydropsyche pellucidula | 3 |
| tpos1 | keep | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT | Phoxinus phoxinus | 6 |
| tpos1 | keep | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT | Rheocricotopus chalybeatus | 2 |
| tpos1 | keep | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT | Synorthocladius semivirens | 4 |
It contains the most abundant sequence for each ltg_name
identified during taxonomic assignment, replicated across all mock
samples. If mock samples differ in composition, you should remove
entries corresponding to taxa not expected in a given sample.
Note: The file does not include species that did not show significant similarity to any ASV. This can happen if:
Here is your text with correct R Markdown internal link syntax preserved and consistent (anchors cleaned + safer IDs where needed).
The idea is to
I suggest that you start by filtering/denoising your data set by using at least some of the following functions as in the Tutorial. This will eliminate most of the erroneous ASV, so it will be easier to identify the expected ASV from your mock samples.
We will use some of the files created by the first part of the Tutorial (till the filter_replicate)
vtamR package, hence
the use of system.file(). When using your own data just
enter your file names.read_count_file is the output of
filter_replicate of the Tutorial.blast_db and taxonomy are set up as in
the Tutoriallibrary(vtamR)
library(dplyr)
read_count_file <- system.file("extdata/demo/8_filter_replicate.csv", package = "vtamR")
taxonomy <- system.file("extdata/db_test/taxonomy_reduced.tsv", package = "vtamR")
sampleinfo <- system.file("extdata/demo/sampleinfo_mfzr_plate1.csv", package = "vtamR")
blast_db <- system.file("extdata/db_test", package = "vtamR")
blast_db <- file.path(blast_db, "COInr_reduced")
blast_path <- "blastn" # Adapt this if BLAST is not in your PATH
Let’s limit the analyses to the mock samples
read_count_df <- read.csv(read_count_file)
sampleinfo_df <- read.csv(sampleinfo)
# select mock samples in sampleinfo
mock_samples <- sampleinfo_df %>%
filter(sample_type == "mock")
# select mock samples from read_count_df
read_count_mock <- read_count_df %>%
filter(sample %in% mock_samples$sample)
assign_taxonomy_ltg will assign all ASV in the input csv
file or data frame (read_count_file).
See more details of taxonomic assignment here.
Note: If BLAST is in your PATH, you can omit the
blast_path argument.
asv_tax <- assign_taxonomy_ltg(
asv = read_count_mock,
taxonomy = taxonomy,
blast_db = blast_db,
quiet = TRUE,
blast_path = blast_path
)
Make a data frame with ASVs and read counts in the wide format and add their taxonomic assignment.
This format is easier to read for humans than
read_count_df.
See details of write_asv_table here.
asv_table_mock <- write_asv_table(
read_count_mock,
sampleinfo = sampleinfo,
asv_tax = asv_tax,
pool_replicates = TRUE
)
Sort the output by the taxon name and then by decreasing read count.
asv_table_mock <- asv_table_mock %>%
arrange(ltg_name, desc(tpos1))
Let’s see the ASV present in tpos1.
knitr::kable(asv_table_mock, format = "markdown")
| asv_id | tpos1 | ltg_taxid | ltg_name | ltg_rank | ltg_rank_index | domain_taxid | domain | kingdom_taxid | kingdom | phylum_taxid | phylum | class_taxid | class | order_taxid | order | family_taxid | family | genus_taxid | genus | species_taxid | species | pid | pcov | phit | taxn | seqn | refres | ltgres | asv |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 224 | 125 | 6656 | Arthropoda | phylum | 3.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 97 | 70 | 70 | 1 | 1 | 8 | 8 | ACTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTAGGAACTTTACTTTCATTATTTATTCGTCTTGAATTAGCTTATCCAGGAAATCAATTTTTTTTAGGAAATCACCAACTTTATAATGTGGTTGTGACAGCACATGCTTTTATCATGATTTTTTTCATGGTTATGCCGATTTTAATC |
| 216 | 79 | 6656 | Arthropoda | phylum | 3.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 97 | 70 | 70 | 1 | 1 | 8 | 8 | ACTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTAGGAACTTTACTTTCATTATTTATTCGACTAGAATTAGCTTATCCAGGAAATCAATTTTTTTTAGGAAATCACCAACTTTATAATGTGGTTGTGACAGCACATGCTTTTATCATGATTTTTTTCATGGTTATGCCGATTTTAATC |
| 2202 | 16 | 6656 | Arthropoda | phylum | 3.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | TCTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTCGGAACTTTACTTTCATTATTTATTCGTCTGGAATTAGCATACCCAGGAAATCAATTTTTTTTAGGAAACCACCAACTTTATAATGTAGTTGTAACAGCACATGCTTTTATTATGATTTTTTTTATGGTTATGCCAATTTTAATC |
| 2208 | 8 | 6656 | Arthropoda | phylum | 3.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | TCTTTATTTCATTTTCGGAACATTTGCAGGAGTTGTCGGAACTTTACTTTCATTATTTATTCGTCTTGAATTAGCATACCCAGGAAATCAATTTTTTTTAGGAAACCACCAACTTTATAATGTAGTTGTAACAGCACATGCTTTTATTATGATTTTTTTTATGGTTATGCCAATTTTAATC |
| 2269 | 423 | 1077837 | Baetis fuscatus | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30073 | Ephemeroptera | 172515 | Baetidae | 189838 | Baetis | 1077837 | Baetis fuscatus | 100 | 70 | 70 | 1 | 1 | 8 | 8 | TTTATATTTCATTTTTGGTGCATGATCAGGTATGGTGGGTACTTCCCTTAGTTTATTAATTCGAGCAGAACTTGGTAATCCTGGTTCTTTGATTGGCGATGATCAGATTTATAACGTTATTGTCACTGCCCATGCTTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
| 5 | 4066 | 189839 | Baetis rhodani | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30073 | Ephemeroptera | 172515 | Baetidae | 189838 | Baetis | 189839 | Baetis rhodani | 100 | 70 | 70 | 1 | 1 | 8 | 8 | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
| 6026 | 12 | 189839 | Baetis rhodani | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30073 | Ephemeroptera | 172515 | Baetidae | 189838 | Baetis | 189839 | Baetis rhodani | 97 | 70 | 70 | 1 | 1 | 8 | 8 | TCTATATTTCATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGATCATGGCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
| 1625 | 3 | 189839 | Baetis rhodani | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30073 | Ephemeroptera | 172515 | Baetidae | 189838 | Baetis | 189839 | Baetis rhodani | 100 | 70 | 70 | 1 | 1 | 8 | 8 | CCTTTATTTTATTTTTGGTGCTTGGGCAGGTATGGTAGGTACCTCATTAAGACTTTTAATTCGAGCCGAGTTGGGTAACCCGGGTTCATTAATTGGGGACGATCAAATTTATAACGTAATCGTAACTGCTCATGCCTTTATTATGATTTTTTTTATAGTGATACCTATTATAATT |
| 1 | 288 | 1592914 | Caenis pusilla | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30073 | Ephemeroptera | 197146 | Caenidae | 197147 | Caenis | 1592914 | Caenis pusilla | 100 | 70 | 70 | 1 | 1 | 8 | 8 | ACTATATTTTATTTTTGGGGCTTGATCCGGAATGCTGGGCACCTCTCTAAGCCTTCTAATTCGTGCCGAGCTGGGGCACCCGGGTTCTTTAATTGGCGACGATCAAATTTACAATGTAATCGTCACAGCCCATGCTTTTATTATGATTTTTTTCATGGTTATGCCTATTATAATC |
| 7452 | 4 | 7149 | Chironomidae | family | 6.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | NA | NA | NA | NA | 85 | 70 | 70 | 4 | 4 | 6 | 7 | ACTATATTTTATTTTTGGGGCATGGTCAGGAATAGTTGGTACTTCCCTTAGTATCCTAATTCGAGCTGAACTAGGACATGCCGGCTCCCTAATTGGAGACGATCAAATTTATAATGTAATCGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
| 6311 | 2 | 41828 | Chironomoidea | superfamily | 5.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | NA | NA | NA | NA | NA | NA | 85 | 70 | 70 | 4 | 4 | 6 | 7 | TTTATATTTTATTTTTGGTATTTGATCAGGTATAGTGGGTACTTCTTTGAGCTTAATAATTCGTACAGAATTAGGTCAGCCAGGTTATTTAATTGGAGATGACCAAATTTATAATGTTATTGTAACTGCTCATGCTTTTATTATAATTTTCTTTATAGTGATACCTATTATAATT |
| 5753 | 4 | 33392 | Endopterygota | cohort | 4.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | NA | NA | NA | NA | NA | NA | NA | NA | 85 | 70 | 70 | 4 | 4 | 6 | 7 | CCTTTATTTTATTTTTGGTGCTTGATCTGGTATAGTTGGTACTTCTTTAAGAATGCTAATTCGAGCAGAATTAGGACGTCCAGGAACATTTATTGGAGATGACCAAGTTTATAATGTTATTGTAACAGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCTATTTTAATT |
| 3 | 15419 | 869943 | Hydropsyche pellucidula | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30263 | Trichoptera | 41030 | Hydropsychidae | 50443 | Hydropsyche | 869943 | Hydropsyche pellucidula | 100 | 70 | 70 | 1 | 1 | 8 | 8 | CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATTGTTACATCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
| 5293 | 2 | 869943 | Hydropsyche pellucidula | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 30263 | Trichoptera | 41030 | Hydropsychidae | 50443 | Hydropsyche | 869943 | Hydropsyche pellucidula | 97 | 70 | 70 | 1 | 1 | 8 | 8 | CCTTTATTTTATTTTCGGTATCTGATCAGGTCTCGTAGGATCATCACTTAGATTTATTATTCGAATAGAATTAAGAACTCCTGGTAGATTTATTGGCAACGACCAAATTTATAACGTAATCGTAACTGCTCATGCATTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
| 5749 | 11 | 43808 | Orthocladiinae | subfamily | 6.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | CCTTTATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
| 395 | 6 | 43808 | Orthocladiinae | subfamily | 6.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTATTATAATTTTTTTTATAGTTATACCAATCATAATT |
| 1682 | 2 | 43808 | Orthocladiinae | subfamily | 6.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTGATACCTATTATAATT |
| 5910 | 2 | 43808 | Orthocladiinae | subfamily | 6.5 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | NA | NA | NA | NA | 90 | 70 | 70 | 3 | 3 | 7 | 8 | TCTATATTTCATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
| 1758 | 8 | 1437201 | Pentapetalae | clade | 4.5 | 2759 | Eukaryota | 33090 | Viridiplantae | 35493 | Streptophyta | 3398 | Magnoliopsida | NA | NA | NA | NA | NA | NA | NA | NA | 100 | 70 | 70 | 1 | 1 | 8 | 8 | TCTATATTTCATCTTCGGTGCCATTGCTGGAGTGATGGGCACATGCTTCTCAGTACTGATTCGTATGGAATTAGCACGACCCGGCGATCAAATTCTTGGTGGGAATCATCAACTTTATAATGTTTTAATAACGGCTCACGCTTTTTTAATGATCTTTTTTATGGTTATGCCGGCGATGATA |
| 6 | 205 | 58324 | Phoxinus phoxinus | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 7711 | Chordata | 186623 | Actinopteri | 7952 | Cypriniformes | 2743726 | Leuciscidae | 42662 | Phoxinus | 58324 | Phoxinus phoxinus | 100 | 70 | 70 | 1 | 1 | 8 | 8 | CCTTTATCTTGTATTTGGTGCCTGGGCCGGAATGGTAGGGACCGCCCTAAGCCTTCTTATTCGGGCCGAACTAAGCCAGCCTGGCTCGCTATTAGGTGATAGCCAAATTTATAATGTTATTGTTACCGCCCACGCCTTCGTAATAATTTTCTTTATAGTCATGCCAATTCTCATT |
| 32 | 205 | 33317 | Protostomia | clade | 2.5 | 2759 | Eukaryota | 33208 | Metazoa | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 80 | 70 | 70 | 4 | 4 | 6 | 7 | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTACAGCTCACGCATTCCTAATGATCTTTTTCATGGTTATGCCAGGACTTGTT |
| 2 | 1562 | 1042866 | Rheocricotopus chalybeatus | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | 611384 | Rheocricotopus | 1042866 | Rheocricotopus chalybeatus | 97 | 70 | 70 | 1 | 1 | 8 | 8 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
| 419 | 9 | 1042866 | Rheocricotopus chalybeatus | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | 611384 | Rheocricotopus | 1042866 | Rheocricotopus chalybeatus | 97 | 70 | 70 | 1 | 1 | 8 | 8 | ACTTTATTTTATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCAATCATAATT |
| 1793 | 2 | 1042866 | Rheocricotopus chalybeatus | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | 611384 | Rheocricotopus | 1042866 | Rheocricotopus chalybeatus | 97 | 70 | 70 | 1 | 1 | 8 | 8 | TCTATATTTCATTTTTGGTGCTTGATCAGGAATAGTAGGAACTTCTTTAAGAATTCTAATTCGAGCTGAATTAGGTCATGCCGGTTCATTAATTGGAGATGATCAAATTTATAATGTAATTGTAACTGCTCATGCTTTTGTAATAATTTTCTTTATAGTTATACCTATTTTAATT |
| 2304 | 7 | 1216507 | Simulium balcanicum | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7190 | Simuliidae | 7191 | Simulium | 1216507 | Simulium balcanicum | 100 | 70 | 70 | 1 | 1 | 8 | 8 | TTTATATTTTATTTTTGGAGCCTGAGCTGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCCGAATTAGGACACCCAGGCTCTCTAATTGGAGACGACCAAATTTATAATGTAATTGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATGCCAATTATAATT |
| 2300 | 10 | 697243 | Simulium lineatum | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7190 | Simuliidae | 7191 | Simulium | 697243 | Simulium lineatum | 100 | 70 | 70 | 1 | 1 | 8 | 8 | TTTATATTTTATTTTTGGAGCCTGAGCTGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCCGAATTAGGACACCCAGGATCTCTAATTGGAGACGACCAAATTTATAATGTAATTGTTACTGCTCATGCTTTTGTAATAATTTTTTTTATAGTTATACCAATTATAATT |
| 574 | 2 | 1419339 | Simulium pseudequinum | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7190 | Simuliidae | 7191 | Simulium | 1419339 | Simulium pseudequinum | 100 | 70 | 70 | 1 | 1 | 8 | 8 | ATTATATTTTATTTTTGGGGCCTGAGCAGGAATAGTAGGTACTTCCCTTAGTATACTTATTCGAGCTGAATTAGGACACCCAGGATCTTTAATTGGTGATGACCAAATTTATAATGTAATTGTTACAGCTCATGCTTTCGTAATAATTTTTTTTATAGTTATACCAATTATAATT |
| 4 | 314 | 611678 | Synorthocladius semivirens | species | 8.0 | 2759 | Eukaryota | 33208 | Metazoa | 6656 | Arthropoda | 50557 | Insecta | 7147 | Diptera | 7149 | Chironomidae | 611392 | Synorthocladius | 611678 | Synorthocladius semivirens | 97 | 70 | 70 | 1 | 1 | 8 | 8 | CTTATATTTTATTTTTGGTGCTTGATCAGGGATAGTGGGAACTTCTTTAAGAATTCTTATTCGAGCTGAACTTGGTCATGCGGGATCTTTAATCGGAGACGATCAAATTTACAATGTAATTGTTACTGCACACGCCTTTGTAATAATTTTTTTTATAGTTATACCTATTTTAATT |
| 602 | 151 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTGTACCTTATATTTGCCTTATTTTCAGGGCTATTAGGTACTGCTTTTTCTGTTTTAATAAGACTTGAATTATCAGGACCTGGTGTACAATACATAGCTGATAACCAACTTTATAACAGTATAATTACTGCACATGCAATACTTATGATTTTCTTCATGGTTATGCCTGCTATGATA |
| 175 | 85 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTCTATTTAATATTTGCTGCATTTTCAGGGGTTATAGGAACAATATTTTCTATAATTATAAGAATGGAACTTGCTTATCCAGGTGATCAAATATTGAATGGTAATCACCAACTTTATAATGTTATTGTAACTGCTCATGCATTTGTAATGATTTTTTTTATGGTTATGCCTGCCTTGATT |
| 562 | 47 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTATACCTTATCTTTTCTCTTTTCTCAGGTTTACTTGGAACAGCATTTTCAGTTTTAATAAGACTTGAATTATCTGGACCTGGTGTTCAGTACATAGCAGACAATCAGTTATACAATAGTATTATTACAGCACACGCAATATTAATGATTTTCTTTATGGTTATGCCAGCAATGATT |
| 6199 | 38 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTTTATATTATTTACGGCTTTCTCATAGGATTGGTTGGTACATTTTTTTCCGCTGTCATTCGTATTCAACTCATGTACCCTGGTTCGTTGTTTTTGGGTGGTAATTACCATTATTATAATACTGTAATTACAGCGCACGCACTTGTGATAATTTTTTTTATGGTCATACCAGTGTTGATT |
| 2378 | 30 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTTTACTTAATTTTTGGTGCTATTTCTGGTGTAGCTGGAACTGCTTTATCACTTTACATCAGATTTACATTATCTCAACCAAACTCGAGTTTTTTAGAATATAACCACCATTTATATAATGTAATTGTTACAGGACATGCACTTATAATGGTTTTTTTTGTAGTAATGCCTATTTTAATT |
| 254 | 24 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTTTATTTTATTTTCGGAGCGTGGTCGGGGATGGTAGGCACATCTCTGAGTCTTTTAATTCGAGCCGAATTGGGTAATCCTGGTTCACTAATTGGGGATGACCAGATTTACAACGTTATTGTAACAGCCCATGCTTTTATTATGATTTTTTTTATAGTAATGCCAATTATGATT |
| 7427 | 19 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | AATGTATCTAATATTTGCAATATTTGCAGGCATTGTTGGTGGACTAATGTCAGTGATACTCAGGCTAGAACTCGCACAACCTGGTAACCAGTTTTTAGGCGGCGATCATCAATTTTATAATGTTATGCTCACTGCTCACGCACTTGTCATGGTATTTTTTATGATTATGCCTGGGCTTTTC |
| 8492 | 16 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TTTATATTTAATATTTGGGCTAATAGCGGGTGTAATAGGAACGTTATTTTCGATATTAATTAGATTAGAATTAGCCTATCCAGGGAATCAATATTTTTTGGGAGATCATCAATTTTATAATGTTGTTGTTACATCACATGCGTTTATTATGATTTTTTTTATGGTAATGCCGGCATTTGTT |
| 1716 | 15 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | CTTGTACATGATTTTCGGAACGTTGGCCGGAGTGGTCGGAACGACGTTGTCGGTATGGATGCGAATGGAATTGGCGGCACCGGGAGTGCAAGCATTGTCGGGAAACCATCAGTTGTATAACGTGATGGTGACGGCACATGCCTTCATCATGATTTTCTTCTTCGTGATGCCCTTTTTGATT |
| 651 | 14 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | CCTATACCTAGTATTTGCAGTATTTGCAGGTATAATTGGTACAGCATTTTCAGTACTAATTCGTATGGAACTTGCAGCACCAGGAGTACAATATCTTAACGGAGATCACCAACTTTATAATGTAGTTATTACTGCACATGCGCTAATTATGATTTTCTTTATGGTTATGCCTGCTCTCGTG |
| 1641 | 14 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | CTTATATCTATTATTTGCAGGCGTTTCGGGTATTGCCGGCACTGTTTTATCTTTATATATACGAGCTACACTAGCAACTCCTGCTTCCAATTTTTTAAGCAAAAATCATCACTTGTATAACGTAATAGTGACAGGCCATGCGTTTTTAATGATTTTTTTTTTAGTAATGCCTGCTCTTATA |
| 631 | 11 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTGTACTTGATTTATGGGGGATTTGCTGGTTTAATTGGAACGATGTTCTCTGTTCTAATAAGAATGGAACTATCATCACCCGGTAATACTATACTAGCTGGTAACTATCAATACTATAATGTTATAGTAACTGCGCATGCTTTCATTATGATCTTCTTTTTTGTTATGCCTGCTATGATG |
| 4784 | 9 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTATACCTTATATTTTCTCTGTTTTCGGGTTTACTTGGAACCGCTTTTTCAGTTTTAATAAGACTTGAATTATCTGGACCTGGTGTTCAGTACATAGCAGATAACCAATTATACAATAGTATAATTACAGCACACGCGATACTTATGATTTTCTTTATGGTTATGCCAGCAATGATT |
| 2195 | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTTTACCTGATCTTCGCCGTATTCTCAGGAATGATTGGTACAGCATTCAGTGTAATTATTCGAATGGAACTTGCTGCGCCCGGTGTGCAATACCTTCACGGTAACCACCAACTATATAACGTAATTATTACAGCCCACGCCTTCCTAATGATCTTTTTCATGGTTATGCCTGGTCTTGTG |
| 7447 | 8 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTATACTTAATCTTTGCATTATTTTCTGGATTATTAGGTACAGCGTTTTCTGTTCTTATAAGATTAGAATTAAGTGGGCCAGGTGTTCAATATATAGCGGACAATCAACTATACAACAGTGTTATTACAGCACACGCTATCTTAATGATATTCTTTATGGTTATGCCTGCAATGATA |
| 12 | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | AATGTATCTAATCTTTGGAGGTTTTTCTGGTATTATTGGAACAGCTTTATCTATTTTAATCAGAATAGAATTATCGCAACCAGGAAACCAAATTTTAATGGGAAACCATCAATTATATAATGTAATTGTAACTTCTCACGCTTTTATTATGATTTTTTTTATGGTAATGCCAATTTTATTA |
| 576 | 6 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTATATTTTCTTTTCGGTACACTATCCGGTGTTATAGGAACAATTTTATCTTTACTTATACGCTTGGAATTAGCATATCCGGGAAATCAATTTTTTTTAGGTAATCATCAATTATACAATGTCGTAGTTACAGCCCATGCATTTTTAATGATTTTTTTTATGGTAATGCCTGTTTTAATT |
| 3844 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TTTATACTTATTATTTGCTGTTTTAGCAGGAGTTGTAGGAACATATTTTTCTGCTTTAATCAGAATAGAGTTAGCATATCCTGGTAATGGAATTTTTAACGGTAATTTTCAACTTTATAATGTTGTAGTAACAGCGCATGCTTTTATTATGATTTTCTTTTTAGTAATGCCAGCAATGATT |
| 4436 | 5 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTATACCTTATCTTCGCAGTATTCTCAGGAATGCTAGGAACTGCTTTTAGTGTTCTTATTCGAATGGAACTAACATCTCCAGGTGTACAATACCTACAGGGAAACCACCAACTTTACAATGTAATCATTATAGCTCACGCATTCCTAATGACCTTTTTCATGGTTATGCCAGGACTTGTT |
| 185 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTGTATTTAATATTTGGTGGCTTTTCGGGTATTATAGGTACTATATTCTCTATGATTATAAGATTAGAATTGGCTGCGCCCGGCTCTCAAATATTAGGTGGTAATAGCCAACTTTATAATGTAATTATTACTGCGCATGCTTTTGTTATGATTTTCTTTTTTGTTATGCCTGTTATGATA |
| 1736 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCTATGATC |
| 5849 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | GTTATATTTAATATTTAGTATAATAGCAGGTTTAGTTGGTACGTGATTTTCAATAATGATAAGAACAGAATTAGCATATCCAGGTTTTCAATATTTTAATGGAGATTTACAACATTATAATGTGATAATTACAGGACATGCGTTCATTATGATATTTTTCATGGTAATGCCAGCATTAATT |
| 8470 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTTTATATTCTTTTTGGAGCTATTGCAGGAGTATGTGGTACTGCAGTCTCCGTAGCGATTAGATTAGAACTTGCTCAACCAGGTGCAGGTATACTATCGTCTAATCACCAGTTATACAATGTTTTTATTACAGCTCATGCTATTTTAATGATTTTTTTCATGGTCATGCCTATTCTTATA |
| 571 | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTATATCTTATATTTGCAGCCTTCTCTGGTATAATAGGAACTATTTTTTCTATTATTATAAGAATGGAATTAGCATTTCCAGGAGATCAAGTTTTGGGCGGTAATCATCAACTTTATAATGTTATTGTCACTGCACACGCTTTTTTAATGATATTTTTTATGGTTATGCCCGCTCTTATT |
| 615 | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTGTACCTTATATTTGCCTTATTTTCAGGGCTATTAGGTACTGCTTTTTCTGTTTTAATAAGACTTGAATTATCAGGACCTGGTGTACAATACATAGCTGATGACCAACTTTATAACAATATAATTACTGCACATGCAATACTTATGATTTTCTTCATGGTTATGCCTGCTATGATA |
| 4790 | 3 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ATTATATTTAATATTTGGGGGTATCTCAGGTGTAGCAGGGACTGTATTATCCTTATACATACGAATAACACTATCGCACCCAGAAGGAAATTTTTTAGAACACAATCACCACTTATACAATGTTATTGTAACAGGTCATGCTTTTGTTATGATTTTTTTTATGGTAATGCCTGTTCTTATC |
| 15 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | ACTATACCTGATGTTTGCCTTATTCGCAGGTTTAGTAGGTACAGCATTTTCTGTACTTATTAGAATGGAATTAAGTGCACCAGGAGTTCAATACATCAGTGATAACCAGTTATATAATAGTATTATAACAGCTCACGCTATTGTTATGATATTCTTTATGGTTATGCCTGCCATGATT |
| 2197 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TCTTTATCTTATATTTGCATTATTTTCAGGGCTTTTAGGTACAGCTTTTTCTGTTTTAATTAGACTAGAATTATCTGGACCTGGAGTACAATACATAGCAGACAACCAATTATACAACAGTATAATAACTGCGCATGCTATTCTGATGATATTTTTCATGGTAATGCCTGCAATGATA |
| 2222 | 2 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | TTTATATATGATTTTTGCAGCCTTTTCAGGAATTGTAGGGACTGTATTTTCAATGTTAATTCGATTTGAATTAGCACATCCAGGACATCAAATTTTATCTGGAAATAACCAATTATACAACGTTATCGTAACGGCACATGCTTTTGTAATGATTTTCTTCATGGTAATGCCTGCATTAATT |
In this mock sample, there should be the following 6 species:
We can see that in spite of all the filtering we have done so far, there are still a lot of unexpected occurrences in this sample.
Most of them have low read counts and could be filtered out by Low Frequency Noise
Filters such as filter_occurrence_read_count,
filter_occurrence_sample,
filter_occurrence_variant.
You can now pick the correct sequences of the expected ASVs in each mock and make the mock_composition file.