Emese Meglécz; Software

vtamR: Validation and Taxonomic Assignation of Metabarcoding data in R

Overview

vtamR is an R-based reimplementation of the VTAM metabarcoding pipeline, designed to provide greater accessibility and modularity.

It includes all core steps — from raw fastq processing to validated ASV tables — while optimizing filtering parameters using mock and negative controls. Built on widely used tools (e.g., BLAST, VSEARCH, CUTADAPT, SWARM), vtamR offers flexible workflows.

Fully cross-platform, vtamR enables reproducible and adaptable metabarcoding analyses for diverse datasets.

Documentation

Reference

Meglécz, E. (2025). vtamR : An R Package for Reproducible and Modular Metabarcoding Analysis FROGSDays 2025 Toulouse, December 2025. Poster

COInr and mkCOInr

Overview

mkCOInr is a series of Perl scripts that aims to create COInr, a large, comprehensive, COI database form NCBI-nucleotide and BOLD.

COInr is freely available and can be easily downloaded from Zenodo

mkCOInr also allows users to customize the database.

Major features of the creation ofCOInr:

Mass download of sequences and their taxonomic lineages from NCBI-nucleotide and BOLD databases
TaxIDs are used to avoid problems with homonyms and synonyms
Creation of a coherent taxID system. The hierarchical structure of the NCBI taxIDs is completed if necessary with new, negative taxIDs.
When adding sequences with unknown taxIDs, taxon names are matched to already existing taxonomic lineages in the database to identify a correct existing taxID, or to assign a new one.
Taxonomically aware demultiplexing
Creation of a ready-to-use database in BLAST, RDP_classifier QIIME or a FULL tsv format

COInr

Is not specific to a particular region of the COI gene (sequences can be partial).
All cellular organisms are included, even Bacteria.
Sequences with incomplete lineages (e.g. assigned to a family without further precision) are present in the database
Taxa are taken into account only with correct latin name formats (e.g. instead of 'Proterorhinus sp. BOLD:EUFWF4948-19', the sequence is assigned to Proterorhinus genus without a species name)

The database can be used directly for similarity-based taxonomic assignations of metabarcoding data with any COI marker (primer pairs) of any geographical regions or target group.

Alternatively, the database can be used as a starting point to create smaller, more specific custom databases. Sequences can be selected for :

A particular gene region (amplicon of a given primer pair)
List of taxa (sequences of a taxon list can be eiter selected or eliminated)
User-defined minimal taxonomic resolution

Reference

Meglécz, E. 2023. COInr and mkCOInr: Building and customizing a non-redundant barcoding reference database from BOLD and NCBI using a semi-automated pipeline. Molecular Ecology Resources, 21, 933-945. Paper

NSDPY: NCBI Sequence Downloader

Overview

Downloading large batches of DNA sequences can be useful to create custom databases containing for example sequences of a particular genomic region or a group of organisms. These sequences can be found on NCBI databases and accessed via a web browser (GUI) or directly via NCBI API. While the GUI is user-friendly, it lacks certain functionalities. On the other extreme, the use of the API is flexible but requires coding knowledge.

NSDPY is a python package that combines flexibility and ease of use to download large amount of DNA sequences and includes several taxonomic or filtering options like batch downloading sequences for a list of taxa, downloading sequences including taxonomic lineage or filtering CDS sequences for a specific gene.

NSDPY is available on PyPI, it is written to minimize dependencies on other packages and to be used directly from the terminal by simple command lines so that most users can use it without prior coding experience.

Reference

Hebert, R., & Meglécz, E. 2022. NSDPY: A python package to download DNA sequences from NCBI. SoftwareX, 18, 101038. doi: 10.1016/j.softx.2022.101038

VTAM: Validation and Taxonomic Assignation of Metabarcoding data

Overview

Metabarcoding studies should be carefully designed to minimize false positives and false negative occurrences. The use of internal controls, replicates, and several overlapping markers is expected to improve the bioinformatics data analysis.

VTAM is a tool to perform all steps of data curation from raw fastq data to taxonomically assigned ASV (Amplicon Sequence Variant or simply variant) table. It addresses all known technical error types and includes other features rarely present in existing pipelines for validating metabarcoding data: Filtering parameters are obtained from internal control samples; cross-sample contamination and tag-jump are controlled; technical replicates are used to ensure repeatability; it handles data obtained from several overlapping markers.

Two datasets were analysed by VTAM and the results were compared to those obtained with a pipeline based on DADA2. The false positive occurrences in samples were considerably higher when curated by DADA2, which is likely due to the lack of control for tag-jump and cross-sample contamination.

VTAM is a robust tool to validate metabarcoding data and improve traceability, reproducibility, and comparability between runs and datasets.

Reference

González, A., Dubut, V., Corse, E., Mekdad, R., Dechatre, T., Castet, U., Hebert, R., & Meglécz, E. (2023). VTAM: A robust pipeline for validating metabarcoding data using controls. Computational and Structural Biotechnology Journal, 21, 1151–1156. Paper

QDD

Overview

In microsatellite development, high throughput sequencing has replaced the classical cloning based methods and in this process the first two versions of QDD played an important role by dealing with the essential bio-informatics steps leading from raw sequences to primer design.

The original version of QDD aimed to extract the best (putative) markers from a few Megabases (5-500 Mb) of 454 pyrosequences, since back in 2010 only this sequencing platform provided sufficiently long reads (300-500 bases).

QDD1 treated all bioinformatics steps from raw sequences all the way to obtaining PCR primers: sorting sequences by tag, removing adapters/vectors, detection of microsatellites, detection of redundancy/possible mobile element association, and primer design.

QDD2 relaxed the primer design conditions, and let the users choose among more markers based mainly on the target region pattern.

QDD3 (current version) aims to improve primer design in the following ways

The current length of Illumina sequencing (100-250 bp) is still slightly short for microsatellite marker development. However, sequencing of hundreds of Gbases of DNA has become affordable and thus – at least for not too big and repetitive genomes - sequences can be assembled into contigs. QDD3 does NOT do de novo assembly, but can take contigs (scaffolds/chormosomes) as an input, and extracts microsatellites with their flanking regions for primer design.
Although the all-against-all comparison of the sequences in QDD versions 1 and 2 could pinpoint some of the putative interspersed or tandem repetitive regions, now a comparison to known repetitive elements via RepeatMasker is also available for the linux comand line version or in the virtual machine.
In previous versions an automatic selection of the 'best' primer-pair was based arbitrary on the penalty score of the primer pairs calculated by Primer3 (http://primer3.sourceforge.net/).
However based on our wet lab results, this is not a good indicator of PCR success. The new choice of one primer pair per locus now depends on the results of our wet lab tests, and provides more meaningful selection.
Multi-threading for BLAST and RepeatMasker is possible.

QDD3 is currently available in two different forms:

A command line version that can be run both on linux and windows operating systems.
QDD integrated into a galaxy server to provide an easy to use interface and many useful sequence treatment tools.

QDD scripts can be either downloaded directly, and they should be installed together with all the third party programs, or they can be downloaded pre-installed (as well as the third party programs) into a ready to use virtual machine (net.imbe.fr/~emeglecz/qdd_download). In both case, they can be run from Galaxy or from a command line.

Reference

Meglécz, E., Costedoat, C., Dubut, V., Gilles, A., Malausa, T., Pech, N., Martin, J.F. 2010. QDD: a user-friendly program to select microsatellite markers and design primers from large sequencing projects. Bioinformatics, 26: 403 - 404 Paper
Meglécz, E., Pech, N., Gilles, A., Dubut, V., Hingamp, P., Trilles, A., Grenier, R. and Martin, JF. 2014. QDD version 3.1: A user friendly computer program for microsatellite selection and primer design revisited: experimental validation of variables determining genotyping success rate. Molecular Ecology Resources, doi: 10.1111/1755-0998.12271. Abstract

SESAME

Overview

SESAME is a user friendly web application package for analyzing amplicon sequences obtained through NGS technologies. It was tested extensively with Mozilla Firefox 3.5 and Chrome web browsers. It is designed to provide individual amplicon alignments so the user(s) can easily validate alleles and distinguish them from sequencing errors and artifacts. It includes automatic sequence assignation to multiple loci and samples via oligonucleotides tags. An assistant guides the user for input data upload and through the automatic sequence analysis steps. It provides an intuitive point-and-click interface to validate sequences as alleles from amplicon sequences alignments. All data are stored in a relational database, that user can query or filter through an intuitive interface. The results are exported as genotypes or sequences of genetic variants.

Reference

Meglécz, E., Piry, P., Desmarais, E., Galan, M., Gilles, A., Guivier, E., Pech, N. and Martin, JF. 2011. SESAME (SEquence Sorter & AMplicon Explorer): Genotyping based on high-throughput multiplex amplicon sequencing. Bioinformatics, 27:277-278 Paper

MicroFamily

Overview

Microsatellite flanking regions are not necessarily unique sequences, but they may group into sequence families. This phenomenon seems to be widespread in Lepidoptera and also occurs in many other insect species (Meglécz et al. 2004). These microsatellites are likely to give multiple banding patterns during PCR amplifications, which can be very difficult to interpret. Therefore, identifying sequences that cluster together prior to primer design can save considerable time and money.

MicroFamily is a program designed for identifying flanking region similarities between different microsatellite sequences obtained from screening partial genomic libraries.

As a preparation for sequence comparison, sequences are edited by (i) replacing all characters other than ACGT by N (ii) by deleting the extremities if they contained more than two Ns in the ten most extreme base pairs, and (iii) by removing vector contamination if the sequence produced a BLAST hit against the UniVec vector base of NCBI (ftp://ftp .ncbi.nih.gov/pub/UniVec/). This latter step was not designed for precise removal of all vectors, but to avoid artificial similarities caused by vector contamination. The E-value (parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size) for the BLAST against the vector base can be specified by the user. Microsatellites with motifs of 1-6 bp are identified. The minimal number of repeats for single base pair motifs and for all other motifs can be defined by the user.

The flanking regions of clean microsatellite-containing sequences are then compared by an all-against-all BLASTn analysis. The E-value can be specified by the user. Sequences are sorted into four categories based on the results of the BLASTn. They are classified as (i) Unique if no similarities were observed to any other sequences of the same dataset, (ii) Redundant, if the identity to another sequence was higher than 95% along the whole flanking sequence or (iii) UnBLASTable if the sequence had no hits at all (i.e. no significant similarity to any sequences), not even with itself. This is the case if the flanking region is too short or semi-repetitive (resembles to a microsatellite but there are not enough uninterrupted repeats). All non-redundant sequences that produced a significant hit with at least one different sequence were classified as (iv) Grouped.

Reference

Meglécz, E. 2007. MicroFamily: A computer program for detecting flanking region similarities among different microsatellite loci. Molecular Ecology Notes, 7 : 18-20 Abstract