Last update: 18th January 2026.
This is a very simple code in Java which implements a single iteration of the common cluster reduction on two rooted binary phylogenetic trees, for computing a rooted maximum agreement forest.
The code has one optional command line switch, -v, which produces slightly more verbose output. Whether you use -v or not, the code produces many comments, which are lines prefixed by //. There will only be 8 lines of output that are not comments, and these are the 4 pairs of trees discussed below. (Note that if there is no non-trivial cluster to be reduced, the code reports that).
I have produced this code because for people not used to working in phylogenetics, there can be quite some confusion about the cluster reduction. That's because it comes in various different guises (for different variations of the agreement forest problem) and because part of the literature that looks at the rooted maximum agreement forest problem on two rooted binary phylogenetic trees, studies it through the lens of the almost equivalent problem of computing rooted subtree prune and regraft distance. The two problems are indeed almose entirely equivalent, except for the addition of an extra taxon \rho which helps deal with a situation concerning the root which can cause the two optima to differ by exactly 1.
Here, I look at the problem purely through the lens of the rooted maximum agreement forest problem, not rooted subtree prune and regraft, and so there is no \rho. Yes, this is the version of the problem considered in the PACE2026 contest (heuristic and lower bound tracks).
The code/functionality is fairly self explanatory. Note that I consider this only a `pedagogical' implementation because in a full solving framework you might be apply the cluster reduction not just once, but multiple times/recursively.
Here is an example of an execution trace. Suppose t1.txt contains the following two trees. Note that they have the common cluster {3,4,5,6}.
((1,2),(3,(4,(5,6)))); ((1,2),((4,6),(3,5)));We type:
java -jar ClusterReduction < t1.txt // ClusterReduction.jj Version 1.0 by Steven Kelk, January 9th 2026 // Applies the cluster reduction ONCE, preferring a common cluster // that has size closest to (n/2) where n is the number of taxa. // Note that it ignores clusters of size 1, 2, n-1 and n. // =============================================================== // Original trees: // ((1,2),(3,(4,(5,6)))); // ((1,2),((4,6),(3,5))); // =============================================================== // Most balanced common cluster had size 4 // =============================================================== // Tree pair 1: the bottom part of the trees with a taxon to mark the root of the cluster ((3,(4,(5,6))),X_DOWN); (((4,6),(3,5)),X_DOWN); // Tree pair 2: the bottom part of the trees WITHOUT a taxon to mark the root of the cluster: (3,(4,(5,6))); ((4,6),(3,5)); // Tree pair 3: the top part of the trees with a placeholder taxon for the cluster. ((1,2),X_UP); ((1,2),X_UP); // Tree pair 4: the top part of the trees WITHOUT a placeholder taxon for the cluster. (1,2); (1,2); // =============================================================== // Cheat sheet: // Let K = OPT(TP2) + OPT(TP4) where TP means 'tree pair'. // Is [OPT(TP1) = OPT(TP2)] AND [OPT(TP3)=OPT(TP4)]? Then OPT = K-1. // Else, OPT = K. // =============================================================== // Here, the OPTs are the minimum number of components in a rooted agreement forest.As you will have noticed: the code produces 4 new pairs of trees and, yes, in the worst case you have to solve all 4 of them in order to conclude what the size of the optimal solution of the original trees is. This sounds onerous but as with all divide-and-conquer strategies (for parameterized/exponential-time algorithms) the hope is that the 4 smaller instances all have significantly smaller optima, thus ensuring an exponential speed up even though you have to solve 4 instances rather than 1. The high-level idea is that a solution of size K = OPT(T2) + OPT(T4) is always possible, but that it is sometimes possible to reduce this by 1. This reduction is possible if and only if the introduction of the two placeholder taxa X_UP and X_DOWN does not cause the optimal solutions to T2 and T4 to grow in size. In formally, this is all the same as asking: does the original pair of trees have an optimal solution that contains a component that has leaves both inside and outside the common cluster? For much more detailed background on the cluster reduction I refer you to...
@article{linz2011cluster,
title={A cluster reduction for computing the subtree distance between phylogenies},
author={Linz, Simone and Semple, Charles},
journal={Annals of combinatorics},
volume={15},
number={3},
pages={465--484},
year={2011},
publisher={Springer}
}
and
@inproceedings{li2017computing,
title={Computing maximum agreement forests without cluster partitioning is folly},
author={Li, Zhijiang and Zeh, Norbert},
booktitle={25th Annual European Symposium on Algorithms (ESA 2017)},
pages={56--1},
year={2017},
organization={Schloss Dagstuhl--Leibniz-Zentrum f{\"u}r Informatik}
}
Note that Chris Whidden in his rSPR solver makes use of the
cluster reduction to great effect! See the 'software' section at the
foot of Chris' homepage here.
Final comment. Note that by default the code calls the placeholder taxa X_UP and X_DOWN. If you apply the code iteratively/recursively, to one of the 4 pair of trees, it will happen at some point that you apply it to a tree pair that already contains the taxon X_UP and/or X_DOWN. In that case it uses X_UP_k and X_DOWN_k as the placeholder taxa, selecting k as small as possible to avoid conflict with other taxa that have the prefix X_UP or X_DOWN.
Why does this all work? A sketch proof is given below that communicates the main ideas. Note that none of these proofs are original: they are simply a re-articulation of ideas that appear in Linz and Semple (see the references above) and elsewhere. Apart from basic facts about agreement forests, the argument below is pretty much self-contained. Recall that we define, K = OPT(TP2) + OPT(TP4) Our goal is to prove the correctness of the following cheat sheet. // =============================================================== // Cheat sheet: // Let K = OPT(TP2) + OPT(TP4) where TP means 'tree pair'. // Is [OPT(TP1) = OPT(TP2)] AND [OPT(TP3)=OPT(TP4)]? Then OPT = K-1. // Else, OPT = K. // =============================================================== // Here, the OPTs are the minimum number of components in a rooted agreement forest. Let X be the set of taxa in the original trees. Let C be the taxa in the common cluster. Let F be an agreement forest of the original pair of trees. We say that a component B of F `bridges' C if B contains at least one taxon from C and at least one taxon from outside C, i.e. at least one taxon from (X \setminus C). We say that F is `bridging' for C if it contains a component that bridges C. (Note that at most one component of F can bridge C). We write |F| to denote the number of components in F. * Fact 1. OPT <= K Argument: Any agreemeent forest for TP2 and any agreement forest for TP4 can be combined to obtain an agreement forest for the original pair of trees. * Fact 2. If there exists an optimal agreement forest for the original pair of trees that is NOT bridging for C, then OPT = K. Argument: From Fact 1 we have OPT <= OPT(TP2) + OPT(TP4) = K. To see that equality holds, observe that any optimal solution to the original pair of trees, that is not bridging for C, necessarily induces optimal solutions to TP2 and TP4 individually. * Fact 3. Consider an agreement forest F for the original pair of trees that is bridging for C. Let B be the bridging component. Let p be the number of components that are entirely contained inside C. Let q be the nunber of components that are entirely contained inside (X \setminus C). (So |F| = 1 + p + q). By splitting B we obtain an agreement forest for TP2 with (p+1) components and an agreement forest for TP4 with (q+1) components. Similarly, by splitting B we obtain an agreement forest for TP1 with (p+1) components and an agreement forest for TP3 with (q+1) components. Hence, OPT(TP2) <= (p+1), OPT(TP4) <= (q+1), OPT(TP1) <= (p+1) and OPT(TP3) <= (q+1). * Fact 4. OPT >= K-1. Argument: if Fact 2 holds, we are done because OPT=K. Otherwise, fix any optimal agreement forest for the original pair of trees, this is necessarily bridging for C. We write OPT = 1 + p + q, in the sense defined by Fact 3. Then, by Fact 3, OPT(TP2) <= (1+p) and OPT(TP4) <= 1+q. Hence, K = OPT(TP2) + OPT(TP4) <= p + q + 2 <= OPT+1, so OPT >= K-1. From Fact 1 and Fact 4, we have that OPT is either K or K-1. So, if we can prove the following statement, we will have established the correctness of the cheat sheet. Let us call this the main claim: Main claim: [OPT(TP1) = OPT(TP2)] and [OPT(TP3)=OPT(TP4)] (if and only if) OPT = K-1. Before proving the main claim, a few more facts are useful. * Fact 5. OPT(TP2) <= OPT(TP1) <= OPT(TP2) + 1 * Fact 6. OPT(TP4) <= OPT(TP3) <= OPT(TP4) + 1 * Fact 7. Any agreement forest of TP1 (let x denote its size) and any agreement forest of TP3 (let y denote its size) can be combined to obtain an agreement forest of size x+y-1 for the original trees. Argument: you merge the component of TP1 that contains X_DOWN with the component of TP3 that contains X_UP (and then delete the X_UP and X_DOWN taxa, which are not part of the original trees). Now, we prove the main claim. (=>) Suppose OPT(TP1)=OPT(TP2) and OPT(TP3)=OPT(TP4), i.e. that the LHS of the main claim holds. Then from Fact 7 we get OPT <= OPT(TP1) + OPT(TP3) - 1. Combining with OPT(TP1)=OPT(TP2) and OPT(TP3)=OPT(TP4) we obtain OPT <= OPT(TP2) + OPT(TP4) - 1. Combining with Fact 4 establishes OPT = OPT(TP2) + OPT(TP4) - 1. (<=) Suppose OPT = K-1. Then, to avoid contradicting Fact 2, there must exist an optimal agreement forest that is bridging. Write OPT = 1 + p + q in the sense defined by Fact 3. So K-1 = OPT = 1 + p + q. From Fact 3, OPT(TP1) <= (p+1) and OPT(TP3) <= q+1. In fact, OPT(TP1) = p+1 and OPT(TP3)=q+1. Why? Suppose it is not true, because (say) OPT(TP1) < (p+1). Hence, OPT(TP1) <= p. By applying Fact 7 to an optimal solution to TP1 and the size (q+1) solution to TP3, we obtain an agremeent forest for the original pair of trees of size at most p + (1+q) - 1 = p+q. But this is less than 1 + p + q, which we said earlier was equal to OPT. Contradiction! A symmetrical argument holds if OPT(TP3) < (q+1). So, OPT(TP1)=p+1 and OPT(TP3)=q+1. We have K-1 = OPT = 1 + p + q = OPT(TP1) + OPT(TP3) - 1 Hence K = OPT(TP1) + OPT(TP3) So OPT(TP2) + OPT(TP4) = OPT(TP1) + OPT(TP3). From Fact 5 and 6 we have OPT(TP2) <= OPT(TP1) and OPT(TP4) <= OPT(TP3). Combining with the equality in the previous line gives OPT(TP1) = OPT(TP2) and OPT(TP3)=OPT(TP4). We are done, \qed.
Here are a couple of other interesting facts, which are not necessary to prove the cheat sheet, but can be useful depending on which algorithm you use to compute optimal agreement forests. * Fact 8: OPT(TP1) != OPT(TP2) (if and only if) there is an optimal agreement forest for TP1 that contains X_DOWN as a singleton component. * Fact 9: OPT(TP3) != OPT(TP4) (if and only if) there is an optimal agreement forest for TP1 that contains X_UP as a singleton component.