Completed on 7 Jun 2013 by Aaron Darling .
Login to endorse this review.
The pdf of this review as well as associated scripts are available from here: https://figshare.com/articles/Review_of_bwa_mem/714096
This is a review of Heng Li “Aligning sequence reads, clone sequences and assembly contigs with BWAMEM”, v2, available from http://arxiv.org/abs/1303.3997
The review was written by Aaron Darling and presented at the Bioinformatics Lunch Bytes meeting at the University of Technology Sydney.
(First of all, I am really grateful to Aaron Darling for his open review. I concur with him in that open review, at least optional open review, helps to improve the peer-review system. The following is my response. Part of it is quite informal, meaning that I would not say it in a formal response, but sometimes what the authors are really thinking behind the words may be also interesting.)
Review of the manuscript:
The manuscript introduces the “mem” module of bwa, a read mapping and alignment program that is widely used in the high throughput sequence analysis field. The manuscript is written in the bioinformatics app note format, which has a 2 page limit. Even though the page limit is highly restrictive the manuscript manages to include some helpful introduction and discussion. I found the methods section to be lacking important detail though (see below).
1. Why does the reseeding approach work? The manuscript does not give any insight into why this helps improve accuracy. The issue seems to be that the longest matches (the SMEM) for a query sequence might not be part of the true alignment. By allowing shorter matches to be alignment seeds it can still find the true alignment in the occasional case where the longest MEM matches the wrong place. That makes sense, but why require
higher multiplicity and restrict only to matches covering the middle base in the SMEM? Also, why is the reseeding threshold 28nt? I can wager a guess that this seemed like a good speed/accuracy tradeoff on a particular dataset but it would be nice to have some idea of whether this is a good heuristic for a wide range of datasets or if it is really tuned for, e.g. human genomes. Sequence analysis has too many magic numbers already. Finally, how frequently is the SMEM not part of the true chain in real data? One way to evaluate this would be to test accuracy with the reseeding disabled.
As Aaron has noticed, reseeding is to improve accuracy in case the optimal hit does not contain any SMEMs. Given 100bp simulated reads, about 0.1% of perfect full-length matches are wrong. If we find SMEMs only, we can only find the perfect matches and thus we can hardly break the 0.1% mark. When we start to find short seeds, we will know most of these reads also have many 1-mismatch hits and assign more proper mapping quality to them.
By searching higher multiplicity covering the middle base of a SMEM, bwa-mem guarantees to find all seeds no shorter than half of the length of the SMEM. It is very rare that the optimal mapping contains seeds less than half of SMEMs (though this still happens). Bwa-mem uses at least 19bp seeds by default. It does reseeding when the SMEM is longer than 19*1.5. This is where 28bp comes. It is a heuristic. Mappers that cannot guarantee to find all optimal hits have something similar.
Bwa-mem can be tuned to only reseed when the optimal hit is a perfect match (option -r100). In this mode, it is about 80% faster than the default (i.e. nearly halved run time). The single-end accuracy is obviously lower. The paired-end is a little lower.
2. How does the seed ranking work? The manuscript says “We rank a seed by length of chain it belongs to and then by the seed length” but this plain language description lacks the precision required to reproduce the method. Is chain length defined as the span of the chain in the query, the reference sequence, the sum of constituent seed lengths or something else? This should be explained.
The length of a chain is the span on the read or on the reference, whichever is shorter. For the vast majority of reads, seeds are non-overlapping on both the read and the reference. (Informal note: chaining has more "magic numbers" about how long to chain. These numbers have little effect on speed/accuracy, but we have to make an arbitrary choice.)
3. The program evaluation is quite limited, in that only a single substitution & indel rate is used to compare aligners. Comparing a wide range of sequence divergences is a pretty substantial computational undertaking so it’s ok to keep the benchmarking simple, but the interpretation of results needs to be strongly qualified in the manuscript. I am specifically concerned that the SMEM based alignment seeding strategy will fall apart at higher
substitution or indel error rates because it depends on finding long exact matches. What happens to reads with dense polymorphisms or sequencing data generated on instruments with high error rates such as the Pacific Biosciences instrument?
The rate of mis-seeding is determined by the chance of finding a 19bp exact seed on a read. Even if the error rate reaches 10%, there will still be a very high chance to see a 19bp seed given a 1000bp read. The simulated data set uses high error rate and evenly distributed errors. It is already harder than typical human resequencing data. On real Illumina human data, the bwa-mem seeding should be an even less concern.
I have not tried bwa-mem on PacBio reads, and I doubt it will perform well. Bwa-mem is not designed with PacBio in mind. Bwa-sw may be more accurate on PacBio data.
4. There is one glaring omission in the comparison set of aligners and that is LAST (http://last.cbrc.jp). LAST can do some clever things such as quality aware gapped alignment scoring and approximate Bayesian aligned read pairing, in addition to long sequence alignment. If there is some reason that a comparison to LAST isn’t possible or appropriate it should be stated so rather than leaving us to wonder about it.
I have tested LAST before the submission. You can find the timing and the evaluation here. In general, LAST is compared favorably to others, but it is less accurate than bwa-mem on my data set. The key reason that I have not included in the ROC-like plot is its questionable implementation. LAST itself is reasonably fast. However, it generates a huge MAF file (for bacteria, this is not a problem) and its MAF-to-SAM generator is slower than alignment. If LAST were able to output the final SAM directly, I would recommend it and include it in the plot. Right now, it is a good mapper for bacteria, but not for mammals.
(As an informal note, when I review manuscripts on NGS mappers, I almost never ask the authors to include additional mappers; when I do raise a request, I usually mark it as a minor/optional comment. Having a couple of more mappers of similar functionality helps little on the overall quality of such manuscripts except satisfying my own curiosity. As to this Application Note, it already evaluates more mappers than most other papers on NGS mapping.)
1. There are numerous minor grammatical and spelling errors which could be easily fixed and would improve readability
I would love to fix these grammatical errors.
2. There seems to be a single mismatch penalty used for the gapped alignment, however, some nucleotide substitutions happen much more frequently than others. At a minimum the ability to treat transitions differently than transversions would be nice, though being able to use a 4x4 substitution matrix (and/or estimating it directly from the data) would be ideal.
For typical resequencing projects, especially for human resequencing, the sequencing error rate is much higher than the rate of variations. It is not necessary to distinguish transitions and transversions. For cross-species alignment where variations are frequent, I agree specifying a scoring matrix is preferred. But even in that case, it will have little effect on mapping accuracy (more on alignment accuracy).
3. No mention is made of whether or how the per base quality estimates produced by sequencing instruments are used during gapped alignment.
Base quality is not used because base quality is frequently inaccurate and it does not seem to bring a substantial improvement on accuracy. More importantly, mismappings caused by sequencing errors tend not to be recurrent; mismappings caused by variants tend to be recurrent, but considering base quality does not help. Only recurrent mismappings have impact on downstream analyses.
4. The value of the “SW rescuing” approach is unclear. The manuscript says the 2nd best SW score is recorded what happens when there are many equally good alignments? The alignment with the 2nd best score might be minimally different. Or is this referring to the best alignment to a different region of the reference sequence? If so, the explanation needs to be improved.
SW is applied to a small window only, typically a few hundred bp. The 2nd best score is the suboptimal score in this window. It helps to identify local tandem repeats.
5. What is a “standing seed” in section 3 paragraph 2?
Seeds that are likely to be on the optimal alignments.
6. The discussion comment that “only bwa mem scales well to large genomes” is misleading since there is a well developed body of literature on methods and software for pairwise and multiple genome alignment covering large genomes and some of them scale very well to large genomes. bwa mem and nucmer are not the only two programs in this space.
The full sentence is "Note that although nucmer is faster in the evaluation, only BWA-MEM scales well to large genomes". It is a little clearer that we are comparing nucmer and bwa-mem here. Nonetheless, I agree it would be good to clarify.
1. The description of how pairing scores are calculated was pleasantly understandable (to me, anyway) and succinct. One question: if each read in a pair has many possible alignment locations then a large number of possible combinations arise. Are all of these scored or does bwa bound the search somehow?
As I understand, LAST essentially takes this approach for pairing, too. When pairing, bwa-mem sorts all hits from both ends together. It then does a linear scan over the list with a window whose size is determined by the insert size distribution. It is not necessary to inspect all pairs.
2. “is more performant” > performs better
It will be changed.
Review of the software:
I obtained bwa0.7.5a from http://sourceforge.net/projects/biobwa/files/ After downloading I used the following commands to build and run the program:
>tar xjf bwa0.7.5a.tar.bz2
>cp bwa $HOME/bin/
Running the software with no arguments produces a help message with enough detail to get started, and running the mem subcommand without arguments gives all the options for bwa mem alignment. I was pleased to see the fine grain control over the alignment parameters available for power users, though I’m sure 99% of users will stick with defaults here because it can be very hard to determine sensible settings for these parameters.
One parameter question: the manuscript describes a Z dropoff which is like BLAST’s Xdrop but allows the gap in either the reference or query to be penalty free. The program help for bwa mem has:
-d INT off-diagonal X-dropoff 
Is this the same thing? If so it would help to improve the variable naming consistency.
>bwa index C_botulinum_F_Langeland.fasta
>bwa mem C_botulinum_F_Langeland.fasta cbot_mem1.fq cbot_mem2.fq > cbot_mem.sam
Based on my comments above, I decided to run my own comparison of bwa to lastal with the approximate Bayesian paired read mapping module. To do this I simulated paired end reads from the Clostridium botulinum F Langeland genome with artsim:
>art_illumina --paired -l 100 -f 1 -m 400 -s 50 -i C_botulinum_F_Langeland.fasta -o cbot_mem
The above command generates about 1x simulated coverage of the C. botulinum reference. I then ran bwa mem as above. last193 was run as (paramters taken verbatim from their documentation at http://last.cbrc.jp/doc/lastpairprobs.html):
>time lastal -i1 -e120 -Q1 cbot_last cbot_mem1.fq > cbot_lastal1.maf
>time lastal -i1 -e120 -Q1 cbot_last cbot_mem2.fq > cbot_lastal2.maf
>time last-pair-probs.py cbot_lastal1.maf cbot_lastal2.maf > lastal_pairs.maf
>time maf-convert.py sam lastal_pairs.maf > lastal_pairs.sam
accuracy was evaluated using a short perl script (see file bwa_mem_eval.tar.bz2 in figshare):
>./accuracy.pl cbot_mem.sam < cbot_mem.both.aln
precision: 0.992127563556135 recall: 0.998284561049445
>./accuracy.pl lastal_pairs.sam < cbot_mem.both.aln
precision: 1 recall: 0.988837162737148
So lastal has remarkably good precision but aligns slightly fewer of the reads with the recommended settings than bwa mem. If we turn these numbers into an F1 score (http://en.wikipedia.org/wiki/F1_score) we get lastal: 0.98883 bwamem: 0.99519 so if you like the balance of precision and recall provided by F1, bwa mem seems like a good choice. For things like identifying rare variants, I will probably stick with lastal when possible since the
occasional misalignments could potentially end up creating a lot of rare variants. It might also be possible to improve lastal’s recall with a lower evalue threshold, I did not explore this.
This evaluation has a flaw: it includes mapQ=0 mappings from bwa-mem but not from LAST (because LAST does not report them by default). Most mapQ=0 mapping are wrong. A better way is:
./accuracy.pl <(awk '/^@/||$5>=5' cbot_mem.sam) < cbot_mem.both.aln
precision: 0.99987314474185 recall: 0.986507797441738
./accuracy.pl <(awk '/^@/||$5>=30' cbot_mem.sam) < cbot_mem.both.aln
precision: 1 recall: 0.983455974370526
At threshold mapQ=5, bwa-mem gives 5 mismapping, including two pairs and one end, all with mapQ<30. For one pair, bwa-mem chooses the position with fewer mismatches. For the other pair, bwa-mem clips the alignment because there are excessive mismatches (~7 in 33bp). If we reduce the mismatch penalty (e.g. with -B2), the simulated position will be recovered. The single end mismapping is also caused by excessive mismatches (18 mismatches in 100bp). Bwa-mem could not find a 19bp seed. Reducing seed length helps to recover the simulated position.
In all, LAST indeed performs better on this data set, but only marginally. Bwa-mem gives 5 mismappings at Q5 because 1) the simulated sequencing error rate is very high, about 8.5% according to the LAST alignment, with errors about evenly distributed along reads. Using shorter seeds and smaller mismatch penalty is preferred. 2) LAST makes use of base quality. I speculate that it does not report the first mismapped pair because critical mismatches are on low quality bases. In practice, mismappings caused by sequencing errors tend not to recur. Such mismappings usually do not lead to false variant calls.
Furthermore, on my data set, bwa-mem is the more accurate. For the most reliable 180k mappings, bwa-mem makes no mistakes, while LAST gives 105 mismappings; at 195k, bwa-mem reports 182 mismappings, while LAST 7802 (novoalign 63).
One difficulty I encountered in benchmarking these is an apparent bug in artsim where it reports the location of one read in each pair seemingly at random, while the other read’s location was given correctly. I introduced some slop into the accuracy script to work around this, in particular that an alignment would be called correct if the mapped location was within 550nt of the true location.
Review of the code:
The program is implemented in C, with a minimalist but functional Makefile provided. I only evaluated the bwamem part of the code. The header file bwamem.h is well documented, with understandable comments describing what the variables, structures, and (most) function interfaces are doing. Use of doxygen structured documentation is nice. The implementation in bwamem.c has much less code commenting. Superficially it seems relatively modular with sensible enough function and variable names that a seasoned and motivated C coder could understand and modify it. I didn’t notice any obvious coding faux pas such as large copied blocks of code, although a few smaller instances could be refactored into functions to remove duplication. In a thorough code review I would probably suggest denser commenting and unpacking or at least description of gems like this:
>if (which && ((p>cigar[p>n_cigar1]&0xf) == 4 || (p>cigar[p>n_cigar1]&0xf) == 3)) qb += p>cigar[p>n_cigar1]>>4;
One little annoyance is the use of hardcoded hexadecimal constants in mem_aln2sam(). While their meaning may be obvious to the creator of the SAM format it is not very readable for the rest of us. On the whole though it looks pretty good. Better than the code I write (though I can not claim to be a model coder)
(I really appreciate that Aaron has evaluated bwa-mem by himself. In my view, if a reviewer wants to reject a manuscript on mappers/assemblers or to give very negative reviews, he/she should at least try the tool to make sure the negative words fully justified. It only takes hours to reject a manuscript written in months. Comments leading to a rejection need to be taken seriously. Unfortunately, very few reviewers run tools nowadays. Admittedly, even I have given very negative reviews once or twice without running the tools.)