The fastq File Format
Fastq files are pervasive in the sequencing world and are quite useful. Some may even say elegant. A single Illumina read entry might look something like this:
For those unfamiliar, the format is as such:
- Sequence information line (beginning with @)
- K00358 The sequencer id
- 46The run id
- HHW3GBBXXThe flowcell id
- 8The lane id (each flowcell has 8 lanes)
- 1101The tile id (each lane has 50-60 some tiles)
- 2686 x-coordinate of the read within the tile
- 1297 y-coordinate of the read within the tile
- 1 The read pair. Forward (1); Reverse (2)
- N Is the read filtered out? Yes or No
- 0 control bit (don't worry about this)
- ACAGTGGT The sample index assigned during DNA preparation
The DNA sequence (may be forward or reverse complemented relative to the reference genome/sequence)
A + sign. Maybe followed by the sequence information line again
The quality score, where each character represents a quality value that corresponds to the base in the same position of line 2. The higher a quality value, the more confident the sequencer was in calling the corresponding base
Line 4 is particularly elegant because a single character represents a range of quality values (usually between 0 and 41). This is cool because our decimal system requires two digits after the number 9, so using decimals would un-align the quality value from the base and create ambiguity without a delimiter (e.g. is 41 a quality score of 41 for one base or quality scores, 4 and 1, for two bases?). So to keep a single character representation, we simply shift down an ASCII table and remap our score values to these single characters.
There's a lot going on with the fastq format and you can read more here if you'd like.
Line 3 is usually just the plus sign. And visually you want to use this line as a read entry separator. But thats not true-- the quality scores for this entry are on the next line! Why didn't the fastq file designers reverse the order of these two lines?! Probably because the file specification evolved past its original intention...
As mentioned above, quality scores are elegant, but they're also hard to read and visualize. Is ; a higher quality value than A? I can't remember...
A New Visual Quality Score System
The wikipedia page for fastq shows a way to convert between quality scores. In order to address annoyance #2, I created a visual bar representation for quality scores. To use it, simply type the following command in your terminal:
sed -e 'n;n;n;y/!"#$%&'\''()*+,-.\/0123456789:;<=>?@ABCDEFGHIJKL/▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇██████/' myfile.fastq
Make sure you replace myfile.fastq with an actual file name and out will print the new visual representation. You can also pipe the output to a new file by appending
The above example before and after the conversion looks like this:
Now you can easily see the relative quality scores for each base! I call this new format clinto format. Maybe I'll address annoyance #1 someday, but until then, enjoy!