CONVEXMPDIST
------------
This, CONVEXMPDIST, is the second piece of software associated with the
paper "A note on convex characters, Fibonacci numbers and exponential-time
algorithms", by Kelk et al.
It is an exponential-time algorithm for computing the maximum parsimony
distance between two binary phylogenetic trees on n taxa. The trees
should be provided in Newick format.
The first piece of code from the article (CONVEXCOUNT), which counts,
samples and lists convex characters where each state occurs on at
least k taxa, can be found at:
http://skelk.sdf-eu.org/convexcount
RUNNING THE CODE
----------------
You will need Java to be installed on your computer (these days it is
automatically installed on most platforms).
To use the already-compiled .jar file, simply type (for example):
java -jar ConvexMPDist.jar < taxa20.txt
The output should look similar to the file taxa20.out.
Note that, although maximum parsimony distance is oblivious to the
presence or absence of a root, the Newick that you feed to
the software should be 'rooted'. That is, at the top level of the
parantheses there should be 2 children, not 3. So, if your tree has the
form (A,B,C) this should be recoded for example as (A,(B,C)). This does
not sacrifice generality or change the maximum parsimony distance.
COMPILING THE CODE
------------------
To compile the code from scratch, you'll need to have both the Java
Development Kit and the JavaCC parser software (used for parsing
the input trees) installed on your system.
Once installed type,
javacc ConvexMPDist.jj
javac ConvexMPDist.java
Then you can run the software by typing,
java ConvexMPDist < taxa20.txt
OTHER ISSUES
------------
* If you specify --quiet on the command line the only output of the
program will be the optimum. For example,
java -jar ConvexMPDist.jar --quiet < taxa20.txt
should return simply the number 10.
* The running time of the code is \Theta( \phi^{n} . poly(n) )
where \phi is the golden ratio (roughly 1.618) and n is the number of taxa
in the trees. As in the paper the poly(n) term is O(n^2).
* In the article the convex characters are generated recursively by
following the dynamic programming downwards from the root. Here we
generate them in a non-recursive fashion to optimize the polynomial part
of the running time. In particular, we observe that when Fitch's
algorithm is applied to a convex character (where each state occurs at
least twice) this generates extensions with a specific form. We loop
through a superset of these extensions. The superset naturally has size
O(3^n). Not all these extensions generate convex characters in which
each state appears at least twice, and some generate the same character.
We have special transition rules to ensure that these invalid characters
are excluded and no time is wasted looping through exponential-size
subsets of invalid characters. This guarantees the \Theta( \phi^{n} .
poly(n) ) running time.
* The code has been used in the experimental part of a second paper,
to appear in Theoretical Computer Science (2016), entitled
"Reduction rules for the maximum parsimony distance on phylogenetic
trees" by Kelk et al. The arXiv version of the article is available at
http://arxiv.org/abs/1512.07459
For more information don't hesitate to contact me at my Maastricht
University email address.
Kind regards,
Steven Kelk, 10th August 2015 and 23rd July 2016