A practical approximation algorithm for solving massive instances of hybridization number. Current version: 1.00 (7th may 2012), source here.

On this webpage you can find the source code and instructions for the algorithm CycleKiller described in A practical approximation algorithm for solving massive instances of hybridization number by Leo van Iersel, Steven Kelk, Nela Lekic and Celine Scornavacca. (This article is an applied follow-up to the more theoretical article, Cycle killer...qu'est-ce que c'est? On the comparative approximability of hybridization number and directed feedback vertex set). The default mode of operation for CycleKiller is a 2-approximation, the faster mode is a 4-approximation.

What you need

The algorithm itself consists of several Java programs chained together which do some processing themselves and also call several external programs (by other authors). At the current time all the external programs are platform independent in the sense that they are either available for multiple platforms (GLPK) or there is C++ source-code available for them (rSPR). Here we describe setting up Cycle Killer under Linux and Windows, although there is no fundamental reason why it cannot be adjusted to work on other platforms.

To get CycleKiller running you will need:

Putting it all together

The simplest way to get CycleKiller running is to compile everything in the same directory. Specifically:

  1. Compile rspr. We henceforth assume that the executable has (if necessary) been renamed to rspr.
  2. Compile the CycleKiller and NELA Java files, if necessary. To do this you will first need to compile the .jj file as follows: javacc NewickToDFVS.jj. (This dynamically creates several .java files). Then you can compile all the .java files simultaneously with javac *.java.
  3. If you wish you can put the GLPK glpsol binary in the same directory (if you do this make sure you have all the necessary libraries there too) or, alternatively, you can tell CycleKiller where glpsol is on your system. You can do this by adjusting the -glpkpath switch inside the or cylekiller.bat scripts, discussed next. After the -glpkpath switch you should give the path to glpsol, the path should end with the glpsol executable itself. If you don't provide the -glpkpath switch then CycleKiller will simply try and execute glpsol without specifying where it is, on some operating systems this might actually work better than explicitly giving the path.
  4. Download an appropriate script file which glues all the above together. For Linux we have provided and for Windows cyclekiller.bat. Under Linux make sure that the .sh file is executable, using the chmod command if necessary. Note that you should add/adjust/remove the -glpkpath switch, as required (see previous point). At the moment the cyclekiller.bat file includes an example of where glpsol is on my windows system.

Running CycleKiller

Everything should now be ready to go. Here is a simple test file, test_trees.txt comprising two trees on eight taxa. If you type cyclekiller.bat test_trees.txt (under Windows) everything should now spark into life. Note that there will be a huge amount of diagnostic information produced. The most important information is at the very end. The line CORE: RETICULATIONS = 3 tells you that CycleKiller has computed a network with 3 reticulations. In the file test_trees_network.txt there will be a network (in eNewick format) that you can view using Dendroscope. There are some other small examples here, based on the dataset discussed here. (Note that this dataset used to be a ``benchmark'' for the hybridization number problem but algorithmic advances are such that neither CycleKiller nor the most recent exact solvers have any problem solving them).

Note that the standard option for CycleKiller is to compute an optimal (i.e. maximum) agreement forest using rspr, which (assuming we solve the DFVS instance optimally) gives us a 2-approximation. For extremely large trees this will be too slow. For such trees we recommend using CycleKiller's -approx switch e.g. cyclekiller.bat test_trees.txt -approx. This asks rspr to construct a 3-approximation to an optimal agreement forest, which it can do extremely quickly. Again assuming we solve the DFVS instance optimally, this gives a 4-approximation.

Finally, please note that for extremely large trees (say, ten thousand taxa) it will be necessary to increase the amount of memory available to your Java virtual machine.

Work in progress

In due course we would like to add support for alternative DFVS solution methods. An obvious option is to add support for CPLEX -- the ILP that we use GLPK to solve is already written in CPLEX format -- and for more intelligent ILP solution methods (e.g. separation-based). It would also be good to add support for non-ILP solution methods and FPT-style "kernelization" techniques, although there are actually very few robust non-ILP implementations in circulation. One interesting possibility to explore is the DFVS solver in the AGAPE algorithm library. Although this is an exponential time algorithm they do apply kernelization-style pre-processing.

Last updated: 10th may 2012