CLUSTISTIC (and even TREETISTIC)

CLUSTISTIC

Last updated: 13th January 2012

CLUSTISTIC is an algorithm that, given a set T of binary trees on the same set of taxa, constructs (all) networks with a minimum number of reticulations that represent Cl(T), the set of clusters induced by the trees in T. It is based on ideas presented in the article On the elusiveness of clusters by Kelk, Scornavacca and Van Iersel. At the moment CLUSTISTIC assumes that the taxa have names 1,...,n where n is the number of taxa.

The main feature of CLUSTISTIC is that it is built on top of the algorithm SIMPLISTIC, which constructs networks from rooted triplets. It is an exercise in demonstrating how software developed for one piecewise construction method (triplets) can easily be "bootstrapped" to solve a similar problem under a different method (clusters). Indeed: it took only a few hours of coding to transform SIMPLISTIC into CLUSTISTIC, demonstrating how exploiting similarities between models can facilitate rapid software prototyping.

CLUSTISTIC version 0.00001 can be downloaded here. This version requires Java, a UNIX shell and a program that can convert DOT files into images. A platform-independent version is forthcoming. To test the code try typing,

./CLUSTISTIC.SH four.tree

where four.tree is a file (included in the distribution) containing four binary trees in Newick format. These are the four binary trees that break CASS. CLUSTISTIC should conclude that a network with at least 3 reticulations is required to represent these clusters, and then stop. The algorithm should produce the following output (where the datestamp will probably differ from what is shown here):

---------------------------------------------
  Welcome to CLUSTISTIC, build 1st Dec 15:55
---------------------------------------------
These are the trees that we will work with:
((((5,1),(6,7)),8),(2,(3,4)));
(((5,8),(6,7)),(2,(4,(1,3))));
(((1,2),(3,4)),(8,(7,(6,5))));
((((5,1),(3,4)),2),(8,(6,7)));
---------------------------------------------
First we will convert the trees to a set of clusters...done.
Now we will convert the trees to a set of triplets...done.
Running main CLUSTISTIC algorithm...full output will be copied to file four.tree                            
.net
// CLUSTISTIC: Going to try construct a network with exactly 1 reticulations.
// CLUSTISTIC: Couldn't find any networks with exactly 1 reticulations.
// CLUSTISTIC: Going to try construct a network with exactly 2 reticulations.
// CLUSTISTIC: Couldn't find any networks with exactly 2 reticulations.
// CLUSTISTIC: Going to try construct a network with exactly 3 reticulations.
// CLUSTISTIC: We found a network!
// CLUSTISTIC: Already found a network, so stopping.
Done
The network(s) can be viewed by running the following command (under UNIX):
dot -Tps four.tree.net -o four.tree.net.ps

If one supplies the optional extra argument --all, i.e.

./CLUSTISTIC.SH four.tree --all

CLUSTISTIC will generate all the networks with 1 reticulation that represent the input, then all networks with 2 reticulations that represent the input, then ... 3 ... and so on. After generating all networks with k reticulations CLUSTISTIC will interactively ask the user if he/she wishes to continue looking for solutions with (k+1) reticulations.

Note that, to ensure consistency with the definition of phylogenetic network given in the accompanying article, CLUSTISTIC will not generate solutions with multi-edges. (It will, however, generate solutions with more complicated redundant substructures as long as these fit into the definition of a phylogenetic network).

TREETISTIC: Computing hybridization number for two or more binary trees

Given that CLUSTISTIC generates all networks (with a given number of reticulations) that represent all the clusters in the input trees, it can also be used to generate all networks (with a given number of reticulations) that display the input trees themselves. In other words CLUSTISTIC can be modified to compute hybridization number for two or more binary trees, and generate all hybridization networks with minimum reticulation number. One simply has to discard all the solutions produced by CLUSTISTIC, which do not display the input trees. (Of course, the code will be very slow, but it might be useful to compare its output to future programs which generate minimum and/or all hybridization networks).

Here is the code (early 2012): TREETISTIC.
Here is a slightly more updated version (nov 2012): TREETISTIC.

As I state above, this code is only experimental. I would only suggest using it for trees with a very small number of taxa. An algorithm based on essentially the same idea could be made much faster simply by bypassing the reduction to triplets/clusters and coding directly for trees. Optimizing the data structures would also help enormously. At the moment TREETISTIC takes time approximately 2^{r}.poly(n).n^{3r} to compute the hybridization number, where r is the hybridization number, n is the number of taxa and poly(n) is a polynomial that does not depend on r. So it is polynomial for fixed r, irregardless of the number of input trees. An easy way to speed things up a bit would be to make the poly(n) term as small as possible; TREETISTIC has an unnecessarily large polynomial here (because of the rather inefficient reduction to triplets and clusters). Any comments/questions, feel free to contact me. (9 feb 2012)