Introduction
SolexaQA is a Perl-based software package that calculates quality statistics and creates visual representations of data quality from FASTQ files generated by Illumina second-generation sequencing technology (“Solexa”).
This package also contains DynamicTrim, which trims each sequence of a FASTQ file individually to the longest contiguous read segment for which the quality score at each base is greater than a user-supplied quality cutoff (or alternately, the read segment returned by the BWA trimming algorithm), and LengthSort, which sorts dynamically trimmed reads into subset files based on a user-defined length cutoff. DynamicTrim and LengthSort should typically be used in combination to remove poor quality bases and reads.
SolexaQA and DynamicTrim now automatically detect input FASTQ file formats. The package can accommodate FASTQ files in Sanger, Solexa and Illumina formats. These file formats are defined in:
Cock, P.J.A., C.J. Fields, N. Goto, M.L. Heuer, and P.M. Rice. 2010. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research 38:1767-1771. PDF, Web Link
All three programs can handle paired-end data. Just supply multiple FASTQ files on the programs’ command lines.
Citation
A paper describing this software has been published in BMC Bioinformatics. If you use this software, please cite:
Cox, M.P., D.A. Peterson, and P.J. Biggs. 2010. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485. PDF, Web Link
Mailing List
A Google groups mailing list is available for users of the SolexaQA package. You are encouraged to join the mailing list here.
Members can discuss the SolexaQA package by sending emails to <solexaqa-users@googlegroups.com>. Help will be provided by others in the user community, as well as by the program developers.
Versions
As of version 1.6+, SolexaQA, DynamicTrim and LengthSort have improved handling of Illumina HiSeq data. Note that HiSeq data is represented as 32 (or 48) virtual tiles – only a quarter of the 120 tiles in GAIIx data. Therefore, statistics are calculated over a larger number of reads, and HiSeq data quality often appears better than sequences generated on the GAIIx. In some instances, this apparent improvement in data quality may in fact be misleading. A similar pattern is observed with MiSeq data.
As of version 1.7+, DynamicTrim and LengthSort can handle Ion Torrent and 454 data.
As of version 1.8+, SolexaQA, DynamicTrim and LengthSort can handle the FASTQ file format changes implemented in Cassava 1.8. Most important among these are the altered header lines and the default Sanger quality score encoding.
As of version 1.9+, SolexaQA can handle HiSeq 32- and 48-tile data.
Version 1.10 is a minor bug fix.
Version 1.11 is a minor change to header code allowing the software to be used on MiSeq data.
Version 1.12 lets users assign an output directory for created files.
Requirements
The SolexaQA package works across a range of platforms, but is primarily designed for high-performance UNIX machines.
SolexaQA requires working installations of Perl, R and matrix2png. DynamicTrim and LengthSort require only a working installation of Perl.
Example datasets are available from the download page. Example plots produced by SolexaQA can be viewed here.
The SolexaQA package is released under the GNU General Public License version 3.
Usage Information
- SolexaQA
-
perl SolexaQA.pl FASTQ_input_files [-p|probcutoff 0.05] [-h|phredcutoff 13] [-v|variance] [-m|minmax] [-s|sample 10000] [-b|bwa] [-d|directory path] [-sanger -solexa -illumina]
Optional command line flags:
-p or -probcutoff # probability value (between 0 and 1) at which base-calling error is considered too high (default; P = 0.05) or -h or -phredcutoff # Phred score (between 0 and 40) at which base-calling error is considered too high (default; Q = 13) * -v or -variance calculate variance statistics (note: approximately doubles run time) -m or -minmax calculate minimum and maximum probabilities for each read position of each tile (note: increases run time by ~25%) -s or
-sample #integer number of sequences to be sampled per tile for statistics estimates (default; s = 10,000) -b or -bwa use BWA trimming algorithm -d or -directory path to directory where output files are saved -sanger Sanger FASTQ format (bypasses automatic format detection) -solexa Solexa FASTQ format (bypasses automatic format detection) -illumina Illumina FASTQ format (bypasses automatic format detection)
* Although quality cutoff values can be entered using either the -p or -h flags, user-supplied Phred Q values are automatically converted to their corresponding probability value. All output is represented as probabilities of base-calling error because interpreting summaries of log probabilities is problematic. (Consider, for instance, how the variance of log probabilities should be interpreted). The user-supplied quality cutoff is used only to calculate the length of the best read segment; all other statistics are unaffected by this parameter.
- DynamicTrim
-
perl DynamicTrim.pl FASTQ_input_files [-p|probcutoff 0.05] [-h|phredcutoff 13] [-b|bwa] [-d|directory path] [-sanger -solexa -illumina] [-454]
Optional command line flags:
-p or
-probcutoff #probability value (between 0 and 1) at which base-calling error is considered too high (default; P = 0.05) or -h or -phredcutoff # Phred score (between 0 and 40) at which base-calling error is considered too high (default; Q = 13) -b or -bwa use BWA trimming algorithm -d or -directory path to directory where output files are saved -sanger Sanger FASTQ format (bypasses automatic format detection) -solexa Solexa FASTQ format (bypasses automatic format detection) -illumina Illumina FASTQ format (bypasses automatic format detection) -454 select this option if trimming Roche 454 or Ion Torrent data in FASTQ format (experimental feature)
If no quality cutoff value is given, DynamicTrim defaults to P = 0.05.
Due to frequent requests, DynamicTrim is now able to trim 454 and Ion Torrent data if given in FASTQ format. The trimming algorithm is functionally identical to that used for Sanger-style FASTQ data. Although 454 and Ion Torrent read trimming appears to work, we do not know how useful this feature will be. Trimming appears to occur most frequently at homopolymer repeats, which is where 454 and Ion Torrent data quality is typically reduced. This function is currently provided as an experimental feature.
Note that SolexaQA, which works quite differently to DynamicTrim, cannot accommodate 454 or Ion Torrent data.
Due to code improvements, version 1.2+ is 48% faster than earlier versions. - LengthSort
-
perl LengthSort.pl one single-end or two paired-end FASTQ files [-l|length 25] [-d|directory path]
Optional command line flags:
-l or -length length cutoff [defaults to 25 nucleotides] -d or -directory path to directory where output files are saved
Authors
SolexaQA was developed by Murray Cox, Patrick Biggs and Daniel Peterson at Massey University Palmerston North, New Zealand.
Code improvements for DynamicTrim were suggested by Douglas Scofield at McGill University, Canada.
Please direct questions to either Murray Cox <m.p.cox@massey.ac.nz> or Patrick Biggs <p.biggs@massey.ac.nz>.