Frequently Asked Questions

How does the SolexaQA package trim reads?
How does the SolexaQA package compare to other trimming methods?
Can the SolexaQA package handle paired-end data?
How does DynamicTrim deal with paired-end data?
Can the SolexaQA package use Illumina 1.5+ and 1.8+ pipeline data?
Can the SolexaQA package be used on 454 and Ion Torrent data?
Can the SolexaQA package handle different header line formats?
How does the SolexaQA package trim reads?

By default, the SolexaQA package extracts the longest contiguous sequence where all bases have a quality score higher than the user-defined threshold (defaults to P = 0.05, or equivalently, Q = 13).

Consider the following read with a poorly called 3′ base:

BaseACACCTGCCG
Quality score30303030303030303010

Using its default threshold, the SolexaQA package would trim this read to:

BaseACACCTGCC 
Quality score303030303030303030 

Now consider a read also containing a poorly called internal base:

BaseACACCTGCCG
Quality score30103030303030303010

Using its default threshold, the SolexaQA package would trim this read to:

Base        ACCTGCC 
Quality score        30303030303030 

How does the SolexaQA package compare to other trimming methods?

Most trimming algorithms are poorly documented. However, one alternative approach is implemented in the read mapper BWA.

For any given read, BWA maximizes the following equation:

BWA trimming equation

if ql < INT where l is the original read length and INT is a user-defined quality threshold (Q score).

So how does this differ from the SolexaQA package? As far as we can tell, BWA and the SolexaQA package trim approximately the same way for reads with poorly called 3′ bases. Consider the read used as an example above:

BaseACACCTGCCG
Quality score30303030303030303010

For this read, the BWA equation is maximized at x = 10. BWA will therefore remove the final base, exactly the same as the SolexaQA package:

BaseACACCTGCC 
Quality score303030303030303030 

However, now consider the read containing an internal error:

BaseACACCTGCCG
Quality score30103030303030303010

The BWA equation is still maximized at x = 10. BWA will therefore trim off only the final base, leaving the poor quality base at position 2 intact:

BaseACACCTGCC 
Quality score301030303030303030 

The BWA algorithm has a number of distinguishing characteristics:

  • Reads can only be trimmed from the 3′ end.
  • Error prone base calls may be retained.
  • Calculations are performed on quality scores (i.e., log probabilities of error). Summing these log error values is the equivalent of multiplying the actual probabilities. These two measurements are not linearly related, and summing the log values therefore places increasingly less weight on poorly called bases.

In comparison, the SolexaQA package:

  • Permits trimming from either end of the read.
  • Removes all bases with qualities worse than the user-defined threshold.
  • Trims on actual probabilities of error (not their log values).

The trimming approach implemented in the SolexaQA package appears more conservative than that of BWA (e.g., it removes all poor quality bases). Whether this suits your needs will largely depend on exactly what analyses you want to do with your trimmed data.

Dynamic trimming appears to improve read mapping rates, as well as improving validation rates of SNP calls. Because de novo assemblers have their own error correction mechanisms, the effect of read trimming on assembly quality appears to be strongly dataset dependent. We encourage robust testing of all methods on both trimmed and untrimmed data.


Can the SolexaQA package handle paired-end data?

Yes, the SolexaQA package was specifically designed for paired end data. Just supply multiple FASTQ files on the programs’ command lines.


How does DynamicTrim deal with paired-end data?

This question only arises when one of the reads is so bad that it has to be removed. We opted to use dummy reads, as described in the following example:

Before trimming:

Forward read file:
@GATC-23_0002:1:1:1142:10040#0/1
GGCTGTTATATATTAGTGATATTTTGGCA
+GATC-23_0002:1:1:1142:10040#0/1
^aaa^a`ab`fffffffddddddddabBB

Reverse read file:
@GATC-23_0002:1:1:1142:10040#0/2
TTGTGTGGTCTTTCAGAAGCTGGGAAACC
+GATC-23_0002:1:1:1142:10040#0/2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBB

After trimming:

Forward read file:
@GATC-23_0002:1:1:1142:10040#0/1
GGCTGTTATATATTAGTGATATTTTGG
+GATC-23_0002:1:1:1142:10040#0/1
^aaa^a`ab`fffffffddddddddab

Reverse read file:
@GATC-23_0002:1:1:1142:10040#0/2
N
+GATC-23_0002:1:1:1142:10040#0/2
B

The dummy read in the reverse read file maintains read order between the forward and reverse read files. This seems preferable to i) having different numbers of unpaired reads in the forward and reverse files, or ii) leaving empty lines in the output file, which many downstream applications are not able to handle.

In terms of downstream applications, de novo assembly software like ABySS or Velvet will simply exclude any reads smaller than the k-mer, so these output files can be used as they are. Alternately, LengthTrim will sort reads into files based on a user-defined cutoff value (say, reads no smaller than 25 nucleotides). Which of these options you choose to employ will depend on the exact nature of your experiment.


Can the SolexaQA package use Illumina 1.5+ and 1.8+ pipeline data?

Yes. Some sources suggest that the Illumina 1.5+ pipelines have redefined the use of quality scores 0 to 2. Values 0 and 1 are no longer used. Value 2, encoded by ASCII value 66 (i.e., character “B”), is used only at the end of reads as a Read Segment Quality Control Indicator. Consequently, it is suggested that the “B” character should not be treated as a quality score.

However, we are yet to see any documentation of this change from an official Illumina source. (According to Illumina Technical Support, there has been no change in the ASCII encoding between versions 1.3 and 1.5). We also believe that there is an alternative interpretation of these low quality scores.

Imagine that we have no information about a nucleotide position, and thus, we assign it a base randomly (say, G). The probability that this base really is a G is 0.25. The probability that it is not a G is therefore 0.75. It is these probabilities of error that are recorded in FASTQ files as ASCII characters.

However, at these high rates of error, ASCII characters encode error probabilities only very sparsely:

ASCII characterQuality scoreError probability
@01.000
A10.794
B20.631
C30.501

Given rounding error, a randomly assigned base should presumably be assigned the ASCII character “B”. Probabilities of error greater than 0.75 make little sense, and we would not expect to observe them. Indeed, given that the “@” character is used in FASTQ header lines, Illumina has a very good reason to exclude them.

In short, we are yet to be convinced whether there is actually a 1.5+ variant of Illumina quality scores. Even if there is, the ASCII characters (and associated quality scores) still seem to retain their natural interpretation — as actual probabilities of error. Until this situation is clarified, we have decided not to make any alterations to the SolexaQA code.

In the 1.8+ pipeline, Illumina have extended their quality scores from 40 to 41. The Phred scores retain their natural interpretation, and the SolexaQA package can accommodate these larger Q values.


Can the SolexaQA package be used on 454 and Ion Torrent data?

Yes, the DynamicTrim and LengthSort programs can handle 454 and Ion Torrent data.

As of June 2011, both 454 and Ion Torrent data have been made available in Sanger FASTQ format. DynamicTrim and LengthSort can therefore process these datasets. Note that SolexaQA itself is restricted to data from Illumina systems.

BGI has released example Ion Torrent data for Escherichia coli isolate TY-2482. Here are untrimmed and trimmed datasets for comparison.

The following graph shows the distribution of read lengths before (black) and after (red) dynamic read trimming (P = 0.05). Note that early release Ion Torrent data is still subject to relatively high error rates, particularly around homopolymer repeats.

Ion Torrent Data Quality

Can the SolexaQA package handle different header line formats?

Yes, all three programs in this package are being continually updated to handle different file formats released by Illumina, including changes in the header lines.

A major change occurred with the release of Illumina’s CASAVA 1.8. Header lines that formerly looked similar to this:

@EAS139-23_0002:1:100:1142:10040#ATCACG/1

changed to something like this:

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

As of CASAVA 1.8, Illumina has also released data solely in Sanger FASTQ format.

Note that Illumina is perversely fond of changing its read header lines (and number of tiles), often for no obviously good reason. To date, we have encountered — and changed the SolexaQA code to accommodate — the following header line variants:

@AGRF-23_0002:1:100:1142:10040#0/1
@114222772_110210mm_lane7.txt_SN603_WA003:7:1:2036.50:19999.80#0/1
@114222772_110210mm_lane7:7:1:2036.50:19999.80#0/1
@txt_SN603_WA003:7:1:2036.50:19999.80#0/1
@7.t:7:1:2036.50:19999.80#0/1
@HWUSI-EAS1758R:18:62RU0AAXX:5:1:1030:952 1:N:0:
@EAS139:136:FC706VJ:2:1101:15343:197393 1:Y:18:ATCACG
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
@HWI-ST938:49:B01D4ACXX:5:1101:2169:194234
@HWI-EAS209_0025_FC427:6:1:1041:14884#ACAGTG/2
@M00143:1:000000000-A0AH7:1:1:16598:1320 1:N:0:1

We are aware of three further variants that SolexaQA does not currently support:

@8:1:670:774
@USI-EAS45-42B86AAXX:7:1:1793:1105:0:1:0
@SRR797058.1 HWI-ST600:227:C0WR4ACXX:7:1101:16297:2000/1

Note that these variants have no unique identifiers to distinguish between them, and considered together, there is no way to identify the correct tile number for all of these variants (either 1 or 1101 in the examples above). Consequently, the ability of the SolexaQA package to analyze all FASTQ variants ever released by Illumina is limited.

Data from the Sequence Read Archive (SRA) is a special case. SolexaQA will run these files if extraneous characters added by the SRA are stripped out ("SRR797058.1 " in the example above).