SIMPLISTIC (SIMPLe network heurISTIC)

Introduction

SIMPLISTIC is a heuristic based on the ideas described by Leo van Iersel and Steven Kelk in the article, Constructing the Simplest Possible Phylogenetic Network from Triplets. SIMPLISTIC has many command-line options with which its functionality can be adjusted, but the core of the idea is as follows. Given a dense set of triplets SIMPLISTIC always returns some network consistent with 100% of the dense input triplets. It does this by repeatedly constructing simple networks (networks with only trivial cut-edges) of minimum level consistent with the set of triplets induced by maximal SN-Sets. It is well known that, for a (dense) set of triplets on n leaves, there exists a level-(n-1) network consistent with those triplets, so the program always terminates. The fundamental features of SIMPLISTIC can be summarised as follows.

  1. If we bound (with a constant) the maximum level that the program considers, the program runs in polynomial time.
  2. If the triplets are consistent with a tree we will find that tree.
  3. If the triplets are consistent with one or more level-1 networks we will be able to find such a network.
  4. If the triplets are consistent with one or more level-2 networks we will be able to find such a network (if the SN-set splitting option is switched on).
  5. If there is a network consistent with precisely the set of input triplets then we will find such a network (if the reflectivity option is switched on) that minimises both the level and the number of reticulations.
  6. If there exists a simple level-k network consistent with the input triplets then we can find all such simple level-k networks
  7. The program always finds some network consistent with all the input triplets.

The code was written, and is maintained, by Steven Kelk.

Latest versions

How to run the algorithm

The following instructions assume a UNIX environment. There are no fundamental reasons however why the algorithm will not work under another operating system environment as long as Java and the graph drawing package DOT are available in that environment. (Please contact Steven Kelk if you are interested in an output format for a package other than DOT.)

We begin by showing how SIMPLISTIC can be run without any complicated command-line options.

  1. It is advisable that you first compile Simplistic.java on your own computer, using the Java compiler javac. This will create a file Simplistic.class (amongst others) in the directory.
  2. Create a file of triplet data. SIMPLISTIC assumes that the species are numbered 1 through n and that there are no missing species.
  3. Run the algorithm with and direct the output to a file. For example, with the command java Simplistic yourtripletdat.txt > output
  4. The file output can be fed into the graph drawing package DOT, which in turn can produce images in various formats. (The file output will also contain various comments - lines beginning with // - to inform the user about what is going on. DOT ignores such commented lines.) For example, to get DOT to produce a postscript file from your data, try dot -Tps < output > mynetwork.ps. Other DOT output formats are possible by adjusting the -T switch.
  5. The code is still in development. If for some reason you encounter a bug or a problem, please inform the author! At the moment internal error reporting is minimal, many error messages simply report an error and an error number, this will be improved in future versions.

We now explain the meaning of the various command-line options that SIMPLISTIC accepts. The command-line has the general form java Simplistic yourtripletdat.txt [options].

--simpleForces the program to only look for simple networks. If this option is not used SIMPLISTIC will try and compute a general network i.e. will (recursively) compute the set of maximal SN-sets and try to compute a simple network consistent with the set of triplets induced by the maximal SN-sets.
--all(This should only be used in conjunction with --simple.) Produces all simple networks consistent with the input triplets. At the moment there is no isomorphism elimination implemented so the output can be rather large when --all is used.
--enewickProduces networks in eNewick format instead of DOT format.
--snsplit [num]Normally SIMPLISTIC computes the set of triplets induced by the set of maximal SN-sets and tries to build a simple network of a particular level consistent with those induced triplets. If it can't do that it then searches for a simple network of higher level, and so on. With the option --snsplit [num] ([num] should be greater than or equal to 0) SIMPLISTIC tries splitting all subsets of size [num] before moving on to the next higher level. 'Splitting' means replacing an SN-set S with the SN-sets that are maximal when the set of triplets is restricted to the leaf set S. We note that specifying --snsplit 1 guarantees that, if there is a level-2 network consistent with the input triplets, that such a network will be found. (If --snsplit is not used and there exists a level-1 network consistent with the input triplets, then SIMPLISTIC will find such a network.) It might be that using --snplit 2 or --snsplit 3 is enough to ensure that a level-3 network is found (if the input triplets are consistent with some level-3 network) but this remains an open question.
--beginlev [num](This should only be used in conjunction with --simple.) Starts at level [num] when searching for simple networks consistent with the input. The argument [num] should be greater than or equal to 1.
--stop(This should only be used in conjunction with --simple.) Normally SIMPLISTIC does not stop once it has found a network. Indeed, even if it finds a solution of level-k then it will continue looking for solutions of level-(k+1) and so on, up to and including level-(n-1). If the --stop option is specified then it will not continue to higher levels.
--percentageIf this option is switched on SIMPLISTIC will report (when constructing simple networks) how close to 100% consistency it came. Using this option heavily slows the code, however. This happens because SIMPLISTIC normally throws a candidate solution away as soon as it is inconsistent with 1 or more input triplets, but with --percentage specified all such solutions need to be carried through to the end. Note that the reported percentages relate to the set of triplets being considered at that moment: this in general will not be the original input set of triplets (it could be an induced set of triplets, for example.)
--reflectMakes SIMPLISTIC check for reflectivity, not consistency. A network reflects a set of triplets if and only if the set of all triplets consistent with the network is equal to the set of input triplets. (Consistency in contrast demands only that the former set is a superset of the latter.) With --reflect specified SIMPLISTIC becomes an implementation of the algorithm MINPITS from the paper Constructing the Simplest Possible Phylogenetic Network from Triplets. In other words, it will find a network of minimum level (and which minimises the number of reticulation vertices) that reflects the input triplets, if such a network exists. Note that it is still not known at what level (as a function of n) we can stop in the search for reflecting networks. Unless otherwise specified using --maxlev the search will be terminated after level-(n-1).
--alltriplets If this option is specified the file you provide as input is not a file of triplets, but a network in DOT format. SIMPLISTIC computes the set of all triplets consistent with this network and then uses that as the input. It is not necessary to be 100% in accordance with DOT format, only the edges have to be specified. So it suffices to provide a file where every line is of the form v1 -> v2 where v1 and v2 are the endpoints of the edge in question. It is, however, essential that leaves are numbered 1...n and that all non-leaves (root, split vertices, reticulation vertices) have numbers greater than or equal to 1000. (This matches with the conventions used by the DOT networks that SIMPLISTIC, MARLON and LEVEL2 output. So those networks can be fed back into SIMPLISTIC if desired.)
--maxlev [num]Normally the program doesn't go further than level-(n-1) but with this option it continues until level max(n-1,[num]). (Should only really be used with --simple.)
--helpPrints a short help message.

Examples

Various examples will appear here in due course. In the meantime please see the webpages for MARLON and LEVEL2 (see below) for some data sets.

Acknowledgements

The code uses the triplet consistency checking algorithm proposed in Worst-case optimal approximation algorithms for maximizing triplet consistency within phylogenetic networks by Jaroslaw Byrka, Pawel Gawrychowski, Katharina T. Huber and Steven Kelk. The code to generate combinations was taken from http://www.merriampark.com/perm.htm.

Relevant links