TerminusEst

Update 30th March 2015. I have just uploaded Version 3 of TerminusEst to the source directory. This is only a cosmetic change from V2. Now the default option is that hashing is switched on. To switch it off, you need to specify the -nohash switch. I have done this because various people have been using the code without realising that they have to actively switch hashing on. The practical performance of TerminusEst is analyzed in the article, A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees by Leo van Iersel, Steven Kelk, Nela Lekic, and Celine Scornavacca, BMC Bioinformatics (2014). (For the theory behind the algorithm I refer to the original TCBB article, see below).

Update 9th September 2013. I have just uploaded Version 2 of TerminusEst to the source directory. The main difference from the original (i.e. 31 august 2012) version is that if you specify the -hash switch on the command line, TerminusEst now uses a look-up table to store intermediate solutions. This improves the performance of the algorithm considerably (because it ensures that the algorithm does not explore the same branch of the search tree twice) but at the expense of exponential memory use. If you do not specify the -hash switch, Version 2 performs identically to the original version. Note also that the output format of Version 2 is slightly changed, to include running-time information and suchlike.

Original, 31st august 2012 version. This is TerminusEst, an experimental implementation of the algorithm described in the article A simple fixed parameter tractable algorithm for computing the hybridization number of two (not necessarily binary) trees by Teresa Piovesan and Steven Kelk (TCBB, 2013). This current version is only a proof-of-concept, and has not been particularly optimized (but it does have a genuinely FPT running time). Also, if you discover any bugs then please report them to us, as the code is still under development.

The algorithm takes an input consisting of two not-necessarily binary trees on the same set of taxa X and produces a binary network with minimimum hybridization number that displays binary refinements of both the trees in the input. (As described in the article, the fact that the output is binary does not sacrifice generality). The algorithm is a faithful implementation of the algorithm described in the article, with the exception that it additionally checks for the existence of two conflicting size-2 clusters, allowing it to restrict its guess (at that iteration) to the three taxa involved.

Download the latest version of the code from here. Compile and run it like this (you'll need the JavaCC parser and the JavaC compiler installed):

javacc TerminusEst.jj
javac TerminusEst.java
java TerminusEst < twotrees.txt

...where twotrees.txt is a file containing two trees, in newick format, e.g.

((a,b,g,d),((c,e),f));
((a,b),(c,d),(e,f,g));

If you specify the -nonetwork option it only computes hybridization number, it doesn't compute a network.

If the whole thing seems to have hung it's probably because you are not using the '<' operator on the command line: the code reads from stdin (i.e. the standard input).

The output should look something like this:

// This is TERMINUSEST, version 31st august 2012.
// Usage: java TerminusEst < treeFile.txt
// (If the program seems to have hung, you have probably
// forgotten the '<' operator).
// -------------------------------------
// We saw 7 taxa in total.
// Trying r=0
// Trying r=1
// Trying r=2
// -----------------------------
// HYBRIDIZATION NUMBER = 2
// -----------------------------
strict digraph G0 {
1000 [shape=point];
1001 [shape=point];
1002 [shape=point];
1003 [shape=point];
1004 [shape=circle, width=0.3, label="a"];
1005 [shape=circle, width=0.3, label="b"];
1006 [shape=point];
1007 [shape=circle, width=0.3, label="d"];
1008 [shape=point];
1009 [shape=circle, width=0.3, label="g"];
1010 [shape=point];
1011 [shape=point];
1012 [shape=circle, width=0.3, label="f"];
1013 [shape=point];
1014 [shape=circle, width=0.3, label="e"];
1015 [shape=point];
1016 [shape=circle, width=0.3, label="c"];
1003 -> 1004
1003 -> 1005
1002 -> 1003
1006 -> 1007
1006 -> 1015
1002 -> 1006
1001 -> 1002
1008 -> 1009
1001 -> 1008
1000 -> 1001
1011 -> 1012
1013 -> 1014
1015 -> 1016
1013 -> 1015
1011 -> 1013
1010 -> 1011
1010 -> 1008
1000 -> 1010
}
strict digraph G1 {
1000 [shape=point];
1001 [shape=point];
1002 [shape=point];
1003 [shape=point];
1004 [shape=circle, width=0.3, label="a"];
1005 [shape=circle, width=0.3, label="b"];
1006 [shape=point];
1007 [shape=circle, width=0.3, label="d"];
1008 [shape=point];
1009 [shape=circle, width=0.3, label="g"];
1010 [shape=point];
1011 [shape=point];
1012 [shape=circle, width=0.3, label="f"];
1013 [shape=point];
1014 [shape=circle, width=0.3, label="e"];
1015 [shape=point];
1016 [shape=circle, width=0.3, label="c"];
1003 -> 1004 [color=blue]
1003 -> 1005 [color=blue]
1002 -> 1003 [color=blue]
1006 -> 1007 [color=blue]
1006 -> 1015 [color=red]
1002 -> 1006 [color=blue]
1001 -> 1002 [color=blue]
1008 -> 1009 [color=blue]
1001 -> 1008 [color=blue]
1000 -> 1001 [color=blue]
1011 -> 1012 [color=blue]
1013 -> 1014 [color=blue]
1015 -> 1016 [color=blue]
1013 -> 1015 [color=blue]
1011 -> 1013 [color=blue]
1010 -> 1011 [color=blue]
1010 -> 1008 [color=red]
1000 -> 1010 [color=blue]
}
strict digraph G2 {
1000 [shape=point];
1001 [shape=point];
1002 [shape=point];
1003 [shape=point];
1004 [shape=circle, width=0.3, label="a"];
1005 [shape=circle, width=0.3, label="b"];
1006 [shape=point];
1007 [shape=circle, width=0.3, label="d"];
1008 [shape=point];
1009 [shape=circle, width=0.3, label="g"];
1010 [shape=point];
1011 [shape=point];
1012 [shape=circle, width=0.3, label="f"];
1013 [shape=point];
1014 [shape=circle, width=0.3, label="e"];
1015 [shape=point];
1016 [shape=circle, width=0.3, label="c"];
1003 -> 1004 [color=blue]
1003 -> 1005 [color=blue]
1002 -> 1003 [color=blue]
1006 -> 1007 [color=blue]
1006 -> 1015 [color=blue]
1002 -> 1006 [color=blue]
1001 -> 1002 [color=blue]
1008 -> 1009 [color=blue]
1001 -> 1008 [color=red]
1000 -> 1001 [color=blue]
1011 -> 1012 [color=blue]
1013 -> 1014 [color=blue]
1015 -> 1016 [color=blue]
1013 -> 1015 [color=red]
1011 -> 1013 [color=blue]
1010 -> 1011 [color=blue]
1010 -> 1008 [color=blue]
1000 -> 1010 [color=blue]
}
// ((((a,b),(d,#H2)),(g)#H1),((f,(e,(c)#H2)),#H1))root;
// First tree displayed by network!
// Second tree displayed by network!
// -----------------------------
// HYBRIDIZATION NUMBER = 2
// -----------------------------

The output is in dot format and can be visualized by feeding it to the Linux command of the same name, or any software capable of supporting the dot format (see the DOT wikipedia page). Note that in dot format all lines that begin with with // are ignored.

The second and third networks show how the trees are displayed. The file here output.pdf shows an example of what the output from the program dot should look like.

Just before the end of the output there is also a line of eNewick produced, which can be viewed (for example) with Dendroscope or at the NetTest Server of Arenas et al.

Technical note, August 2013: When the input consists of two binary trees the code can easily be adapted to give a running time of (2^r).(r!).poly(n). The details are in the MSc thesis of Teresa Piovesan (2012), available on request. The core idea is to include only terminals that occur in size-2 clusters, as potential guesses.

Why the name TerminusEst? Because of the use of terminals in the underlying algorithm. It is also a nod to the Book of The New Sun by Gene Wolfe - my favourite book.