CycleKiller
A practical approximation algorithm for solving massive instances of hybridization
number. Current version: 1.00 (7th may 2012), source here.
On this webpage you can find the source code and instructions for the algorithm CycleKiller
described in A practical
approximation algorithm for solving massive instances of
hybridization number
by Leo van Iersel, Steven Kelk, Nela Lekic and Celine Scornavacca. (This article is
an applied follow-up to the more theoretical article, Cycle killer...qu'est-ce
que c'est? On the comparative approximability of hybridization number and directed feedback
vertex set). The default mode of operation for CycleKiller is a
2-approximation, the faster mode is a 4-approximation.
What you need
The algorithm itself consists of several Java programs chained together which do some processing
themselves and also call several external programs (by other authors). At the current time all the
external programs are platform independent in the sense that they are
either available for
multiple platforms (GLPK) or there is C++ source-code available for them (rSPR). Here we describe setting up Cycle
Killer under Linux and
Windows, although there is no fundamental reason why it cannot be adjusted to work on other platforms.
To get CycleKiller running you will need:
- A Java Virtual Machine (JVM). Specifically: you should be able to execute Java programs by typing
java at the command line. To compile
the Java you will need access to the Java compiler javac and
the Java parser javacc.
- An executable version of Chris Whidden's rspr
software, v1.03. Given two binary trees on the same set of taxa rspr outputs an agreement forest. One
option of rspr is to output an optimum (i.e. maximum) agreement forest, and another option is to
output a 3-approximation to the maximum agreement forest. These are the two options that CycleKiller
uses.
Note: due to slight changes between versions of rSPR we cannot
guarantee the behaviour of CycleKiller if versions of rSPR earlier than v1.03 are
used. You will need a C++ compiler to compile rspr on your platform.
- The GNU Linear Programming Kit (GLPK). In particular you will need
access to the executable glpsol, this is the heart of the GLPK solver. CycleKiller was tested under several
versions (4.45, 4.47) of GLPK, but the main version used in development
was version 4.47. Source-code for GLPK is freely
available although it is often already installed (or very easy to install) on Linux systems and there are also Windows binaries available.
- Our program NELA which converts an acyclic
agreement forest into a phylogenetic network. This is written in Java and was developed in parallel
with CycleKiller.
- If you wish to view the phylogenetic network created (rather than just knowing its reticulation
number) you will need a program capable of
visualising eNewick, such as Dendroscope.
Putting it all together
The simplest way to get CycleKiller running is to compile everything in the same directory. Specifically:
- Compile rspr. We henceforth assume that the executable has (if necessary) been renamed to rspr.
- Compile the CycleKiller and NELA Java files, if necessary. To do this you will first need to compile the
.jj file as follows: javacc NewickToDFVS.jj. (This dynamically creates several .java
files). Then you can compile all the .java files simultaneously with javac *.java.
- If you wish you can put the GLPK glpsol binary in the same directory (if you do this make
sure you have all the necessary libraries there too) or, alternatively, you can tell CycleKiller where
glpsol is on your system. You can do this by adjusting the -glpkpath switch inside the
cyclekiller.sh or cylekiller.bat scripts, discussed next. After the -glpkpath switch you should give the
path to glpsol, the path should end with the glpsol executable itself. If you don't provide the
-glpkpath switch then CycleKiller will simply try and execute glpsol without specifying where it is,
on some operating systems this might actually work better than explicitly giving the path.
- Download an appropriate script file which glues all the above together. For Linux we have
provided cyclekiller.sh
and for Windows cyclekiller.bat.
Under Linux make sure that the .sh file is executable, using the chmod command if necessary.
Note that you should add/adjust/remove the -glpkpath switch, as required (see previous point). At the moment
the cyclekiller.bat file includes an example of where glpsol is on my windows system.
Running CycleKiller
Everything should now be ready to go. Here is a simple test file,
test_trees.txt
comprising two trees on eight taxa. If you type cyclekiller.bat
test_trees.txt (under Windows) everything should now spark into life.
Note that there will be a huge amount of diagnostic information produced.
The most important information is at the very end. The line CORE:
RETICULATIONS = 3 tells you that CycleKiller has computed a network
with 3 reticulations. In the file test_trees_network.txt there
will be a network (in eNewick format) that you can view using Dendroscope.
There are some other small examples here, based
on the dataset discussed here. (Note
that this dataset used to be a ``benchmark'' for the hybridization number problem but algorithmic
advances are such that neither CycleKiller nor the most recent exact solvers have any problem solving
them).
Note that the standard option for CycleKiller is to compute an
optimal (i.e. maximum) agreement forest using rspr, which (assuming
we solve the DFVS instance optimally) gives us a 2-approximation.
For extremely
large trees this will be too slow. For such trees we recommend
using CycleKiller's -approx switch e.g. cyclekiller.bat
test_trees.txt -approx. This asks rspr to construct a 3-approximation
to an optimal agreement forest, which it can do extremely quickly. Again
assuming we solve the DFVS instance optimally, this gives a
4-approximation.
Finally, please note that for extremely large trees (say, ten thousand
taxa) it will be necessary to increase the amount of memory available
to your Java virtual machine.
Work in progress
In due course we would like to add support for alternative DFVS solution methods. An obvious option is to
add support for CPLEX -- the ILP that we use GLPK to solve is already written in CPLEX format --
and for more intelligent ILP solution methods (e.g. separation-based). It would also be good to
add support for non-ILP solution methods and FPT-style "kernelization"
techniques, although there are actually very few robust non-ILP implementations
in circulation. One interesting possibility to explore is the DFVS solver in the AGAPE algorithm library. Although this
is an exponential time algorithm they do apply kernelization-style pre-processing.
Last updated: 10th may 2012