QDD - version 3
A user-friendly program to select microsatellite
markers and design primers from large sequencing projects.
Windows and Linux
Command line and Galaxy server version
Emese Meglécz and Jean-François Martin
Aix-Marseille Université, CNRS, IRD, UMR 7263 – IMBE, Equipe EGE Case 36, 3 Place Victor Hugo, 13331; Marseille Cedex 3, France
Montpellier SupAgro, INRA, CIRAD, IRD, Centre de Biologie et de Gestion des Populations, Campus International de Baillarguet, CS30016, 34988 Montferrier-sur-Lez, France
emese.meglecz(at)imbe.fr
http://net.imbe.fr/~emeglecz/qdd.html
Overview
In microsatellite development, high throughput sequencing has replaced the classical cloning based methods and in this process the first two versions of QDD played an important role by dealing with the essential bio-informatics steps leading from raw sequences to primer design.
The original version of QDD aimed to extract the best (putative) markers from a few Megabases (5-500 Mb) of 454 pyrosequences, since back in 2010 only this sequencing platform provided sufficiently long reads (300-500 bases). QDD1 treated all bioinformatics steps from raw sequences all the way to obtaining PCR primers: sorting sequences by tag, removing adapters/vectors, detection of microsatellites, detection of redundancy/possible mobile element association, and primer design.
QDD2 relaxed the primer design conditions, and let the users choose among more markers based mainly on the target region pattern.
QDD3 (current version) aims to improve primer design in the following ways
- The current length of Illumina sequencing (100-250 bp) is still slightly short for microsatellite marker development. However, sequencing of hundreds of Gbases of DNA has become affordable and thus – at least for not too big and repetitive genomes - sequences can be assembled into contigs. QDD3 does NOT do de novo assembly, but can take contigs (scaffolds/chormosomes) as an input, and extracts microsatellites with their flanking regions for primer design.
- Although the all-against-all comparison of the sequences in QDD versions 1 and 2 could pinpoint some of the putative interspersed or tandem repetitive regions, now a comparison to known repetitive elements via RepeatMasker is also available for the linux comand line version or in the virtual machine.
- In previous versions an automatic selection of the 'best' primer-pair was based arbitrary on the penalty score of the primer pairs calculated by Primer3. However based on our wet lab results, this is not a good indicator of PCR success. The new choice of one primer pair per locus now depends on our wet lab tests, and provides more meaningful selection.
- Multi-threading for BLAST and RepeatMasker is possible.
- Fastq files can also be used as input file (the default option is the fasta format).
QDD3 is currently available in two different forms:
- A command line version that can be run both on linux and windows operating systems.
- QDD integrated into a galaxy server to provide an easy to use interface and many useful sequence treatment tools.
QDD scripts can be either downloaded directly, and they should be installed together with all the third party programs, or they can be downloaded pre-installed (as well as the third party programs) into a ready to use virtual machine. In both case, they can be run from Galaxy or from a command line.
Citation
- Meglécz, E. Costedoat, C., Dubut, V., Gilles, A., Malausa, T., Pech, N. and Martin J-F. 2010.
QDD: a user-friendly program to select microsatellite markers and design primers from
large sequencing projects. Bioinformatics,
26(3) 403–404.
- Meglécz, E., Pech, N., Gilles, A., Dubut, V., Hingamp, P., Trilles, A., Grenier, R. and Martin, JF. 2014. QDD version 3.1: A user friendly computer program for microsatellite selection and primer design revisited: experimental validation of variables determining genotyping success rate. Molecular Ecology Resources, doi: 10.1111/1755-0998.12271. Abstract
General Description
QDD is composed of four parts. Each of them can be run separately, or all at once in a pipeline
PIPE1: Sequence preparation and microsatellite detection
The input sequences can be either assembled sequences (contigs, scaffold or chromosomes) or non-assembled sequences reads.
- Assembled sequences (contigs, scaffold or chromosomes): Microsatellites are extracted with a user defined flanking region on both sides. Adapter clipping is irrelevant in this case.
- Non-assembled sequences: Adapters can be removed (optional), short reads are eliminated and only sequences with microsatellites are kept for further analysis.
The input file format can be both fasta or fasq format. Fastq files are converted into fasta files without quality filtering. It is thus essential to appropriately trim low quality regions before starting QDD.
Input reads should be 100-1000 bp long, but QDD3 is not adapted for treating Gbases of short Illumina reads.
PIPE2: Sequence similarity detection
Sequences are compared by an all-against-all BLAST and sorted according to their similarities into the following categories:
- Singletons => Singletons (the only BLAST hit is an autohit)
- Nohit css => Low complexity sequences (no BLAST hit to itself)
- Multihit => Putative minisatellites (more than one hit between a pair of sequences)
- Grouped => Sequences (including consensuses) that had BLAST hit to other sequences, with below limit identity of the overlapping region. Regions covered by BLAST hits are masked.
- Consensus => All unique (no hit to grouped seqs) consensus sequences based on the alignment of reads, where the pairwise identity was at least 95% on the overlapping region. Consensus sequences are marked as polymorph if polymorphism in the microsatellite length is detected among the aligned reads.
The main output file (Input file for pipe3) contains all unique sequences (singletons and consensus).
If the sequences were extracted from assemblies, making consensus sequences does not make sense. Pipe2 input sequences are still compared by an all-against-all BLAST, but only unique sequences (singletons) are kept for primer design.
PIPE3: Primer design
Primers are designed in an iterative way for each sequence.
- By increasing size of the PCR products (To force design of primers that have longer products)
- From best to worst scenario for the target region (one single perfect microsatellite => multiple microsatellites, homopolymers, nanosatellites)
PIPE4: Contamination check and comparison to known transposable elements
- Contamination check: An optional contamination check can be
done by BLASTing all sequences with successful primer design against the
nt database of NCBI and checking the taxonomic classification of the
best hit. This step can be done either by a local BLAST or by a remote
BLAST. The first option requires the download of the nt database from
NCBI (ca.15 Gb) and it is faster than the second. The second option
relies on a good internet connection, it is much slower and connection
time outs can be frequent.
This step does not pick out particular sequences as contaminant, but can warn the users of serious general contamination (or mixing up samples) if taxonomic groups of the best hits do not match the target species.
- Comparison to known transposable elements: Sequences with successful primer design can be compared to known transposable elements by running RepeatMasker from QDD (available in the VM version or in command line version run on linux system
- Contamination check: An optional contamination check can be
done by BLASTing all sequences with successful primer design against the
nt database of NCBI and checking the taxonomic classification of the
best hit. This step can be done either by a local BLAST or by a remote
BLAST. The first option requires the download of the nt database from
NCBI (ca.15 Gb) and it is faster than the second. The second option
relies on a good internet connection, it is much slower and connection
time outs can be frequent.
Main output file(s): Primer table
The main output files are the primer tables produced by pipe3 and completed by pipe4. They are named xxx_pipe[3,4]_primers.tabular in command line version and Table with primers in Galaxy.
Each line corresponds to a primer pair and there are several primer pairs designed for each sequences.
For each primer pair information is given on
- the sequence
- the target region (number, type, motif, length of microsatellite),
- the primers (position length, annealing temperature...).
Pipe4 completes the output table of pipe3 with
- information on the best hit against Genbank (accession, e-value, score, taxonomy) and
- best hit to a known transposable element.
Take your time to understand the information in the different columns in this table, since it helps you to choose markers and primer pairs out of the many designed. (see detailed description of the columns in Output files section)
How to choose primers from the primer table.
The following suggestions are based on our lab tests (Meglécz et al submitted) or simply on common sense. The title of the column in the output file relevant for each selection criterion is indicated in capital letters.
- ONE_PRIMER_FOR_EACH_SEQ: An automatic selection of 1 primer pair for each locus, based on our lab tests. (See detailed description of the columns in Output files section)
- Avoid primers with high alignment score to the sequence (PCR_PRIMER_ALIGNSCORE, annealing sites are not considered for calculation of the alignment score)
- Better to have a pure microsatellite, then a compound (PURE/COMPOUND)
- Microsatellites with more repeats are more likely to be polymorphic (TARGET_MS_LENGTH_IN_REPEAT_NUMBER)
- Avoid microsatellite motifs that can form hairpin (e.g. (AT)n; MOT_TRANS)
- Choose markers in different ranges for PCR product length to facilitate multiplexing (PCR_PRODUCT_SIZE)
- Avoid primers that are very close (>20 bp) to the target microsatellite (MIN_PRIMER_TARGET_DIST)
- Choose compatible annealing temperatures of the primers if you have changed the default values of primer3. According to the default values, all primer TM could vary between 57 and 63 C, which makes most primers pairs compatible for multiplexing (PRIMER_LEFT_TM, PRIMER_RIGHT_TM)
- If input sequences were contigs, avoid selecting markers that are near to each other on the same contig (FIRST_POS_ON_CONTIG, CONTIG_CODE)
- If you have run RepeatMasker, avoid primers with good hits to transportable elements (RM_score; High score indicates a good alignment between TE and your sequence)
- Consensus sequences based on read numbers much higher than expected coverage are probably derived from different loci of a repetitive element, thus should be avoided (NUMBER_OF_READS).
- Prefer target regions that do not have multiple microsatellites, nanosatellite, homopolymers (DESIGN A, B)
Glossary
- Perfect (pure) microsatellite: Microsatellite composed of one single motif of 2-6 bp length with no interruption. The minimum number of repetition is arbitrary set to 5.
- Nanosatellite: 3-4 tandem repetition of a 2-6 bp motif.
- Hompolymer: At least 5 tandem repetition of a single base.
- Compound microsatellite: Pure micro- and nanosatellites are pooled into a compound microsatellite if the distance between them is equal or less than the longest of the two motifs. Homopolymers are never pooled with micro- or nanosatellites.
- Target microsatellite: Pure or compound microsatellite with at least 5 uninterrupted repetitions of a 2-6 bp motif.
- Target region: The region of the read that should be between the primers. There can be one or more target microsatellites in a target region.
- Genomic multicopies: Loci present more than once in the genome. They can be either the results of duplication events or transposition.
- Flanking region: The whole sequence apart from the target microsatellites. This simple definition can be applied, since the lengths of the reads are compatible with PCR, thus it is not necessary to define a maximum for length of a flanking region.
- Soft masking in BLAST: BLAST prevents seeding (starting the alignment by a perfect match of a predefined length) in masked regions, but allows alignment extension through them if soft masking is applied.
- Tag: A short DNA stretch added at the 5’-end of the DNA fragment to be sequenced for identification. Different tags can be added to DNA from different sources (e.g. species) and the pooled DNA is loaded on a non-fractioned PicoTiter plate, thus gaining space and quantities of reads. Sequences coming from different sources are identified according to their tag.
Disclaimer: The software on this page is free to download and use, and thus comes with no warranty of any kind. While it hasn't caused us any problem, the current version of QDD is still considered as a beta version and you are responsible for any damages or loss of data you may sustain while using this software.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
Correspondence, comments and bug reports about this program should be addressed to Emese Meglécz