SolexaQA

SolexaQA 2

Introduction

SolexaQA is a software package to calculate sequence quality statistics and create visual representations of data quality for Illumina’s second-generation sequencing technology (historically known as “Solexa”). Running directly on Illumina FASTQ files, the package contains three component programs:

SolexaQA — the primary quality analysis and visualization tool. Designed to run on unmodified FASTQ files obtained directly from Illumina sequencers.

DynamicTrim — a read trimmer that individually crops each read to its longest contiguous segment for which quality scores are greater than a user-supplied quality cutoff (or alternately, the read segment returned by the BWA trimming algorithm).

LengthSort — a program to separate high quality from low quality reads. LengthSort assigns trimmed reads to paired-end, singleton and discard files based on a user-defined length cutoff.

DynamicTrim and LengthSort are typically used in combination to remove poor quality bases and/or reads from high throughput sequence data. Both programs should work on any FASTQ file (modified or unmodified).

The SolexaQA package automatically detects input FASTQ file formats (i.e., the Sanger, Illumina and Solexa formats described by Cock et al. 2010). All three programs are designed to run on single-end or paired-end data, including reads from the latest versions of the HiSeq and MiSeq machines. High quality graphics are produced by interfacing automatically with R.

The SolexaQA package is freely released under the GNU General Public License version 3.


Citation

If you use this software, please cite:

Cox, M.P., D.A. Peterson, and P.J. Biggs. 2010. SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485. PDF, Web Link


Mailing List

We encourage users to join the (low traffic) Google Groups mailing list.

Members can discuss the SolexaQA package by sending emails to <solexaqa-users@googlegroups.com>. Help will be provided by the software developers, as well as by the user community.


Software Versions

Release 2.0 is a substantial rewrite of the SolexaQA package.

The software has been heavily optimized, and for most datasets, release 2.0 now runs 2-3 times faster than earlier versions. The graphical outputs have been completely reworked — they are now more informative than ever and more visually appealing. This version no longer requires a working installation of matrix2png. All graphics are generated automatically using the award winning graphical and statistical software R.

Release 2.1 is a minor code change. Heat maps are now returned as pdfs. This resolves issues writing png files using some versions of R under UNIX.

Release 2.2 is a minor code change. 'Empty' sequences can now be read by the BWA algorithm of DynamicTrim.


Requirements

The SolexaQA package works across a range of platforms, but is primarily designed for high performance UNIX machines. The code works without modification on OS X.

SolexaQA needs working (base) installations of Perl and R. Most UNIX systems will already have these programs installed. No additional packages or modules are required.


Usage Information

SolexaQA
perl SolexaQA.pl FASTQ_input_files [-p|probcutoff 0.05] [-h|phredcutoff] [-v|variance] [-m|minmax] [-s|sample 10000] [-b|bwa] [-d|directory path] [-sanger -illumina -solexa]

Command line arguments:
-p or -probcutoff # probability value (between 0 and 1) at which base-calling error is considered too high (default; P = 0.05, or Q ≈ 13) or
-h or -phredcutoff # Phred quality score (between 0 and 41) at which base-calling error is considered too high
-v or -variance calculate variance statistics (note: this approximately doubles the run time)
-m or -minmax calculate minimum and maximum probabilities for each read position for each tile (note: the increases the run time by ~25%)
-s or-sample # integer number of sequences to be sampled per tile for statistics estimates (default; s = 10,000)
-b or -bwa use BWA trimming algorithm
-d or -directory path to directory where output files are saved
-sanger Sanger FASTQ format (bypasses automatic format detection)
-illumina Illumina FASTQ format (bypasses automatic format detection)
-solexa Solexa FASTQ format (bypasses automatic format detection)

Quality cutoff values can be entered using either the -p or -h flags. If no quality cutoff value is given, SolexaQA defaults to P = 0.05. User-supplied Phred Q values are automatically converted to their corresponding probability value.

All output is represented as probabilities of base call error. This is because interpretations of log probabilities can be problematic. For instance, consider difficulties interpreting the variance of log probabilities.

The user-supplied quality cutoff is used only to calculate the length of the best read segment; all other statistics are unaffected by this parameter.

Six output files are created per input FASTQ file. These have the name of the original FASTQ file with the following suffixes appended:

*.quality — a tab-delimited text file with mean probabilities of base call error for each read position (rows) for each tile (columns). The first column represents the global data (average base call error rate across all tiles). If variance, minimum, and maximum options are selected, these statistics are also displayed in the rightmost columns of this file.

*.quality.pdf — a line graph showing the mean probability of error by nucleotide position along the read. Each tile is represented by a dotted line with a red line indicating the global mean.

*.matrix.pdf — a heat map depicting the mean probability of error for each nucleotide position and tile. Each row represents a different tile, and each column corresponds to a nucleotide position. The color scale progresses from white to black through yellow and orange, with pure white representing an error probability of 0 and pure black an error probability of 0.75 (i.e., any of the other three bases are equally likely to be correct).

*.matrix — a tab delimited text file showing mean probabilities of base calling error for each nucleotide position (columns) and each tile (rows). The arrangement of this matrix corresponds exactly to the layout of the heat map as shown in the *.png file.

*.segments — contains metrics on the longest contiguous segment of each read in which the quality of each base is greater than the user-defined quality cutoff (defaults to P = 0.05). This tab delimited text file displays the proportion of all reads that contain a segment of a given length.

*.segments.hist.pdf — a histogram showing the distribution of the longest contiguous segment of each read in which the quality of each base is greater than the user-defined quality cutoff (defaults to P = 0.05). This figure draws upon the data in the *.segments file.

*.segments.cumulative.pdf — a line graph showing the cumulative frequency of trimmed read lengths. This figure also draws upon the data in the *.segments file.


DynamicTrim
perl DynamicTrim.pl FASTQ_input_files [-p|probcutoff 0.05]
[-h|phredcutoff] [-b|bwa] [-d|directory path] [-sanger -illumina -solexa] [-454]

Command line arguments:
-p or-probcutoff # probability value (between 0 and 1) at which base-calling error is considered too high (default; P = 0.05, or Q ≈ 13) or
-h or -phredcutoff # Phred quality score (between 0 and 41) at which base-calling error is considered too high
-b or -bwa use BWA trimming algorithm
-d or -directory path to directory where output files are saved
-sanger Sanger FASTQ format (bypasses automatic format detection)
-illumina Illumina FASTQ format (bypasses automatic format detection)
-solexa Solexa FASTQ format (bypasses automatic format detection)
-454 select this option if trimming Roche 454 or Ion Torrent data in FASTQ format (experimental feature)

Quality cutoff values can be entered using either the -p or -h flags. If no quality cutoff value is given, DynamicTrim defaults to P = 0.05. User-supplied Phred Q values are automatically converted to their corresponding probability value.

Three output files are created per input FASTQ file. These have the name of the original FASTQ file with the following suffixes appended:

*.trimmed — a FASTQ file containing the trimmed reads.

*.segments — contains metrics on the longest contiguous segment of each read in which the quality of each base is greater than the user-defined quality cutoff (defaults to P = 0.05). This tab delimited text file displays the proportion of all reads that contain a segment of a given length.

*.segments.hist.pdf — a histogram showing the distribution of the longest contiguous segment of each read in which the quality of each base is greater than the user-defined quality cutoff (defaults to P = 0.05). This figure draws upon the data in the *.segments file.

Due to frequent requests, DynamicTrim is now able to trim 454 and Ion Torrent data if provided in FASTQ format. The trimming algorithm is functionally identical to that used for Sanger-style FASTQ data. Although 454 and Ion Torrent read trimming appears to work, we do not know how useful this feature will be. Trimming seems to occur most frequently at homopolymer repeats, which is where 454 and Ion Torrent data quality is typically most reduced. This function is currently provided as an experimental feature.

Note that SolexaQA, which works very differently to DynamicTrim, cannot accommodate 454 or Ion Torrent data. LengthSort can accommodate the output files produced by DynamicTrim on 454 or Ion Torrent data.


LengthSort
perl LengthSort.pl one single-end or two paired-end FASTQ files [-l|length 25] [-d|directory path]

Command line arguments:
-l or -length length cutoff [defaults to 25 nucleotides]
-d or -directory path to directory where output files are saved

LengthSort assigns trimmed reads to files based on a user-defined length cutoff. Single-end data is sorted into reads that pass the length threshold (*.single) and reads smaller than the length threshold (*.discard). Paired-end data is sorted into two useable paired read files (*.paired1 and *.paired2), usable single reads (*.single) and non-usable reads that do not pass the length threshold (*.discard).

Either two or four output files are created per input FASTQ file. These have the name of the original FASTQ file with the following suffixes appended:

*.single — reads that are larger than or equal to the length cutoff. If paired-end data is being analyzed, these are reads that fulfill the length requirement, while their paired read does not.

*.discard — reads that are smaller than the length cutoff.

*.summary.txt — a tab delimited text file listing the number of reads assigned to the *.single and *.discard files (and if run on paired-end data, the *.paired1 and *.paired2 files as well).

Paired-end data only:

*.paired1 — forward reads that are larger than or equal to the length cutoff. The reverse read pair also passes the length cutoff and is present in the *.paired2 file.

*.paired2 — reverse reads that are larger than or equal to the length cutoff. The forward read pair also passes the length cutoff and is present in the *.paired1 file.

Authors

SolexaQA was developed at Massey University, New Zealand by Murray Cox, Patrick Biggs and Daniel Peterson. Release 2.0 was substantially reworked by Mauro Truglio.

Please direct questions to Murray Cox <m.p.cox@massey.ac.nz>, Patrick Biggs <p.biggs@massey.ac.nz> or Mauro Truglio <m.truglio@massey.ac.nz>.