S-Kademlia and max-flow-min-cost

This post assumes you know what Kademlia is, and roughly how its distance metric and routing algorithms work. It is a much more developed version of this post, with a newer & better algorithm, and more detailed explanations both high- and low-level. The old post may be considered obsolete.

Actually, there are many possible routing algorithms based on any given distance metric, and while we know how to analyse various properties of a routing algorithm, the field has not yet devised a single objective way to compare two routing algorithms, nor even what it means to be “the best” routing algorithm. Most of the reasonable ones all follow a similar pattern though, roughly:

  1. each node maintains a routing table, which contains ~k “close” entries, ~k “further” entries, ~k “even further” entries etc, for super-linearly expanding notions of “further away”
  2. routing is done by choosing the neighbour in your routing table that is closest to the target, and then iterating to their neighbour, etc etc.

The devil is in the details of course, and one major thing is that in practise you want to do this in parallel - but how precisely? The original paper defines a parallel routing algorithm roughly as follows:

  1. query the closest d neighbours
  2. out of the combined set of their results, select the closest next d
  3. loop until you’ve reached your target(s), or run out of patience

This is an insecure parallel routing algorithm - if even only 1 of the neighbours is malicious, they can trivially attack the victim by giving d results that are fake nodes “very close” to the victim’s target, but are in fact all controlled by the attacker. From this point onwards the attacker can lead the victim on a wild goose chase, repeatedly giving him bad results that are again controlled by the attacker. This is an example of the general class of attacks called the “eclipse attack”.

S-Kademlia improvements

The attack described just above, comes from the fact that (A) we combine everyone’s results together, losing information about where each result came from, then filter the combination using (B) a criteria that is easily gamed, i.e. nodeId XOR distance.

(B) partly is due to the context of operating in a decentralised network, where new IDs can be generated freely. Securing this aspect of the system is a hard topic we won’t go into here, and it is best done separately from the routing algorithm. We do note that S-Kademlia also proposes an improvement in this area, but it is actually insecure and does not even improve the situation on top of regular Kademlia.

(A) is what we’ll focus on. S-Kademlia proposes a reasonable improvement, but as we’ll explain, this improvement still has some bad corner cases, and we take it to its logical conclusion with a fully-general version that covers all corner cases.

S-Kademlia makes two main modifications to the routing algorithm. Firstly and foremost, to the lookup traversal:

To quote: “The initiator starts a lookup by taking the k closest nodes to the destination key from his local routing table and distributes them into d independent lookup buckets. From there on the node continues with d parallel lookups similar to the traditional Kademlia lookup. The lookups are independent, except the important fact, that each node is only used once during the whole lookup process to ensure that the resulting paths are really disjoint.”

In other words:

  1. Keep track of d separate lookups for a given target lookup. As before, each lookup will hop through a bunch of different peers.
  2. When the results of a lookup are returned, select the next hop for this lookup out of these results, but ignore any nodes you’ve already previously queried in any of the other lookups.

In this way, all the lookups query entirely different sets of peers, and therefore a malicious peer in one lookup cannot take over another lookup, since the other lookups will ignore this peer.

Secondly, to the end of the algorithm, the result collection part:

To quote: “the lookup doesn’t converge at a single node but terminates on d close-by neighbors, which all know the complete s siblings for the destination key. Hence a lookup is still successful even if k - 1 of the neighbors are adversarial.”

Explanation: s here refers to the storage replication factor, which S-Kademlia separates away from k the routing replication factor. In original Kademlia, s = k, and the routing algorithm returns the k closest peers observed during the routing. In S-Kademlia, instead the routing algorithm finishes with d peers, one from every disjoint path, and each “final peer” gives s results. The union of all of these results is then returned as the result of the overall lookup. In other words, the caller gets between s and d*s results, all of which must be used for any level of security.

After trying to implement these algorithms, which forces us to think through many specific cases in detail, including randomly generated test cases. we find that S-Kademlia’s specific algorithms contain corner cases that degrade its performance in a real network. Furthermore the way it constructs its final result set is awkward to work with, in terms of concretely using its security guarantees. The details of these are technical and given in the appendix. To give a summary here, S-Kademlia:

  • is unable to backtrack properly to find global constraint solutions, when the disjoint-paths constraint cannot be satisfied locally, and this is dependent on the order in which results are received. Our algorithm always finds the best global solution, regardless of the order in which results are received.

  • does not allow disjoint paths to terminate in the middle of other disjoint paths; all nodes in the path are unavailable for other paths to make use of. In a non-adversarial setting this can reduce the result set too much, in a way that significantly disrupts the lookup; conversely relaxing this in an adversarial setting does not reduce security as we only terminate in the middle of another path, if the first path has no closer options anyway.

  • does not apply its disjoint-paths security logic to the result sets themselves - instead of selecting 1 result from each peer at the end of each d disjoint path, it uses all s results for each peer, potentially giving s*d results in total. Callers are then forced to use all the results, even if half of them are anomalous - information about the anomalies are lost by the algorithm.

All these corner cases came up during our attempted implementation of S-Kademlia, and they are solved cleanly by our improved algorithm below.

Defining the security property

First we’ll describe a high-level way of thinking about the derivation of the following algorithm, then explain how the algorithm works, then explain why it works. To note: this “high-level thinking” was only our second step towards coming up with the below - our first step was the realisation and understanding of the corner cases in the appendix, but that takes some time working through the technical details and makes for harder reading. So for brevity and structure we skip that here, and start with this second step.

Having found some issues with S-Kademlia’s specific approach, we begin to salvage it by thinking about how to separate what it does vs how it does it - in other words, the properties it’s trying to achieve, vs the algorithm trying to achieve these properties. For example, “shortest path” is a property, “Dijkstra’s algorithm” is obviously, an algorithm. In the S-Kademlia paper these two things are somewhat mixed together, and neither aspect is formalised, so some level of licensed interpretation is required. In our view, the property that S-Kademlia wants to achieve can be described as:

  • Suppose we have d initial peers and we want to traverse them to d final peers. We must ensure that no 2 final peers were reached only via some single common node.

We state this in its briefest form so that it is easier to refer back to it later, and now we in explain it in more detail: As we perform a query, we make requests to nodes and they send us back a set of nodes in their response. We can model this as nodes in a graph telling us about their successors, and so during the course of our query we are building up a directed graph [1] of queried peers to result peers. Importantly, a single result peer might have “come from” multiple different queried peers.

In terms of this query graph then, the S-Kademlia property is that for any pair of final peers, there must exist two disjoint paths from some two initial peers, one path to each final peer. If this were not the case, then there would be a common node that is the only node that (directly or indirectly) gave you those two peers, and this node could be an attacker that gave you fake peers. So that’s why this property helps with security. By induction, the pairwise form of the property implies the full form of the property - i.e. that there must be d disjoint paths to every set of d final peers.

Note that this is different from what the S-Kademlia algorithm is doing! The algorithm attempts to satisfy the property, by assigning peers to paths during the course of the traversal. This is an entirely greedy approach to solving the constraint, is analogous to attempting to solve the shortest-paths problem by only following the shortest path at each step, and is why it fails for certain cases. It also throws away the information needed to perform the traversal correctly, namely which result peers came from which queried peers.

By thinking about the security property in more detail, and working our way through more corner cases, we eventually realised that the property can be translated into a max-flow-min-cost problem. Algorithms for solving these are well-known, and efficient. The core of our work then was to figure out how to do the translation most appropriately, which we give below.

Max-flow-min-cost routing

Our modified S-Kademlia query algorithm covers both the lookup traversal stage, and the result collection stage. In both cases, we translate the query graph into a flow graph (with associated costs), then run max-flow-min-cost to obtain our results. The translations are described in the next sections. They are each parameterised on an idea of a “candidate set”, and the second is further parameterised on an idea of a “query set”; this is explained afterwards. We also describe how these fit together in the context of the whole lookup algorithm.

[1] By filtering out results that are further away than the source peer, we can remove any cycles in the graph. But in fact with our max-flow-min-cost approach, cycle-removal is unnecessary and actually undesirable for the final part of the query - the final peer in each path is supposed to give you their s neighbours closest to the lookup target, some of which may be further away than themselves. Notwithstanding, a prudent implementation should still filter out results that are “too much” further away from the target than the queried node.

Max-flow-min-cost for queries

  1. Initialise an empty flow graph (with associated costs).
  2. For every node in the query graph, including ourselves, it will be converted into two nodes in the flow graph, “in” and “out”.
    a. All incoming edges instead connect to “in”
    b. All outgoing edges instead connect from “out”
    c. Connect “in” to “out”
  3. Mark the “in” node for our own node as the “source” of the flow
  4. Add a new node to be the “sink” of the flow
  5. For every node in our “candidate-set”, connect its “in” node to the sink
  6. Every edge has capacity 1, except:
    a. The edge from our own “in” to our own “out” node has capacity d
  7. Every edge has cost 0, except:
    a. Every edge to the sink has cost equal to the distance-to-our-query-target of the source-node of that edge
  8. Run max-flow-min-cost on this flow graph. From the max-flow, the nodes adjacent to the sink are the overall result of this algorithm.

Max-flow-min-cost for results

  1. Determine the successors of each node in our “query set”, that belong to our candidate set. We call these the “relevant successors”.
  2. For simplicity, assume all the successor sets are the same size, call this N. (We have a more generalised version, in the code listing.)
  3. Initialise a flow graph (with associated costs) with a new source node.
  4. For every node in the query set:
    a. Add an edge from the source to this query node, with capacity N.
    b. Add an edge from this query node to each of its relevant successors, with capacity 1.
    (If a query node is also a relevant successor, treat these as separate nodes in the flow graph when constructing it.)
  5. For every relevant successor:
    a. Add an edge from this node to the flow sink, with capacity N.
  6. As with the first algorithm, every edge has cost 0, except:
    a. Every edge to the sink has cost equal to the distance-to-our-query-target of the source-node of that edge
  7. Run max-flow-min-cost on this flow graph. From the max-flow, the nodes adjacent to the sink are the overall result of this algorithm, along with their flow values (ranging from 1 to N).

How it fits together

a. On query startup, and whenever we update the query graph, we run the first algorithm with:

  • candidate-set = nodes we haven’t got a result/failure/timeout from

This will give us up to d results. At-most-1 of these results will be a peer we haven’t queried yet [2], this will be the next peer to query. The rest of these will be peers that we are currently in the process of querying.

If a peer times out or gives an error, we add them to the failed set, add them to our query graph with an empty successor set, and run (a) as normal.

b. To check whether the query may be terminated, we run the first algorithm with:

  • candidate-set = nodes we haven’t got a failure/timeout from. This should be equivalent to the disjoint union of:

    a. nodes we’ve already gotten a query result from
    b. nodes we’re in the process of querying
    c. nodes we found out about and could potentially query

If the output is itself a subset of (a), then the query may be terminated, and results may be collected via the second algorithm, see (\c) next. Alternatively, the caller may choose to continue with the query, in case the traversal finds even better nodes later, but this is unlikely assuming Kademlia actually works.

c. When we want to collect results from the query, we run the second algorithm with:

  • query-nodes = nodes we got from a successful run of (b)
  • candidate-set = nodes we haven’t got a failure/timeout from

This will give us all the results from those closest query-nodes, in sorted order, first by their flow and then by their distance. Suppose the caller believes that up to a fraction f of nodes are faulty, then roughly speaking they can use the result nodes that have flow > f*d and ignore the rest. If an insufficient number of nodes (e.g. < s, the storage replication factor) meet this condition, the caller immediately knows this and can take a higher-level action such as alerting any relevant network operators.

Why does this work? The flow graph we generate, expresses the same constraints of S-Kademlia’s disjoint paths approach:

  • By (6a) we have total in-capacity of d. Therefore the max-flow-min-cost algorithm will give us up to d results.

  • In (2) we only allow each node to carry one flow (unlike a typical max-flow problem where nodes have infinite capacity, restricted only by their edges), this means that every possible max-flow consists of d disjoint paths each with flow 1. This satisfies the S-Kademlia “disjoint paths” property.

  • By adding edges from every candidate to the sink, we allow these nodes to be used as the terminus of a path, even if they are part of another path. This helps with the special case we mentioned earlier and discuss in more detail.

  • By associating costs with each candidate node, we add the constraint that we want the query to progress towards our target, whilst retaining the “disjoint paths” security property.

Why does this algorithm work better than S-Kademlia?

  • Iterative P2P queries are effectively graph traversal algorithms that try to solve certain constraints. Like similar constraint-solving problems, backtracking is often required. S-Kademlia’s disjoint paths approach adopts a particular set of constraints, however it attempts to solve these constraints by effectively running a graph-traversal algorithm that does not do any (global) backtracking - in fact it throws away the information needed to do this. And so, depending on the arbitrary choices it makes during the traversal (i.e. which peer to query next, based on the available set) it can find the best solution local to each “disjoint path”, but then be unable to find the best global solution in the next step when more data is added to the problem context.

    This algorithm avoids this by retaining all the information, and casting the problem as a global constraint solving problem. It is thus able to always find best-solutions, even if they are not built directly on top of the best-solution in the previous step.

Why do we need two separate algorithms?

  • While the max-flow-min-cost algorithm is efficient compared to more general NP constraint-solving problems, one characteristic is that it sometimes gives a max-flow that is split across equally-good candidates. This is sometimes desirable and sometimes not.

    In the first algorithm, we are looking for no more than d results. So we supply a limited amount of flow, and by giving every edge capacity 1 we ensure that splitting never occurs.

    In the second algorithm, we are interested in comparing all our results. So we supply a lot of flow, allow splitting to happen, and use this to help our comparison.

[2] Proof to follow; to give a sketch, this can be done by comparing the differences between results on each successive call to the algorithm. Essentially, since we restrict node-flows to 1, each new result set (added to the graph) can only add at-most-one peer to the “next peers to query” set.

Appendix - bad cases & corner cases

In the below, for simplicity all our lookups are for target key “0”, which makes the XOR distance the same as the node id.

Backtracking over redundant routes

Suppose d=k=s=3, and we issue a query to our neighbours {4,5,6}. They reply, in order:

  1. node 4 replies {1,2,3}
  2. node 5 replies {1,2,3}
  3. node 6 replies {4,3,2} - note that this is perfectly legitimate in the general case since node 6 can’t be expected to know that we also have node 4 as a neighbour

(click to view animation)

In S-Kademlia, one possible choice of paths would be {4,3} and {5,2}, the red and blue path in the diagram. Once you have made these choices, there is nothing further to do when you get a reply back from node 6 (green) because all its results have already been used up by other paths.

This effectively reduces your query from this point onwards to be 2 parallel lookups instead of 3. In other words, 2 peers now control the eventual lookup result, instead of 3 - which is what we specifically want to avoid with disjoint paths. By the input parameters, we have the resources to execute 3 lookups in parallel, but we end up under-utilising this due to too-restrictive rules on S-Kademlia’s part.

Let’s take a step back however. Why did we choose {4,3} and {5,2}? These are arbitrary choices. If we had chosen something else, namely {4,1} and {5,3}, we could have then chosen {6,2} and this would still be within S-Kademlia’s own rules.

The problem arises from greedily choosing to traverse one part of a graph optimising a constraint (disjoint path + close distance), then immediately discarding the rest of the graph. This prevents us from backtracking later when our chosen path prevents us from solving our constraints. The max-flow-min-cost algorithm handles this case gracefully - going from step 2 to step 3 it will recalculate what the max-flow is, finding the correct solution as shown in our diagram.

Backtracking over failed routes

Another more complex example, this one involving multiple levels of backtracking and failed queries.

Suppose d=k=s=3, and we issue a query to our neighbours {10,11,12}.

(click to view animation)

Nodes 1 and 2 both fail to respond. We then want to reassign the paths and select 7 as the next peer to query, even though it’s unrelated to the “10 -> 5 -> 1” path, because 6 was already selected previously.

Good results case

(click to view animation)

This demonstrates one consequence with being too strict with disjoint paths. In a real-world situation, several paths may end up wanting to traverse the same set of nodes near the target. However the disjoint-paths rule prevents any part of the path from being traversed twice, effectively hiding them from our results. Furthermore at the end of each path, these results are typically going to be further away than our target node - otherwise we would have queried the closer nodes as part of the path.

These aspects do not play nicely with each other, and often we find that we miss results such as in this example - node #50 gives us back #100 and #160 as the results, yet node #200 knows of a closer node #140.

Therefore we relax the S-Kademlia disjoint paths rule slightly, by allowing paths to terminate in the middle of other paths. The other rules about single-flow still apply, so two paths cannot terminate at the same node - which would give more weight if that node was indeed malicious. And if there are closer alternatives, then the path will follow those as usual and this exception won’t be needed.

This relaxation was already deemed to be necessary by other implementers, before we came up with our max-flow-min-cost algorithm.

Final results

With the S-Kademlia approach: If you ask for s results, you will be given between s and d*s results. If >=1 of the final queried nodes (in the paths you visited) is honest, then at least s of the results will be correct. However we cannot perform reasoning more fine-grained than this, and so the caller must use (i.e. contact) all these results for any minimal belief of security.

In the good case, the majority of result sets (from each final queried peer a.k.a terminus) should overlap, and so we’ll get around s results. In the worst case, there is no overlap and we are forced to consume d*s results; no protection can get around this since the entire data is faulty. However in a more realistic attack scenario, most of your peers are honest and give you overlapping result sets, but a few of them are attackers and give you bogus results. In this case, you will be forced to consume 2*s or 3*s results.

This may not be a big deal, but with just a little more sophistication we can easily take care of this, by running the max-flow algorithm. This demonstrates that with a more theoretically-principled approach, we do not have to make compromises on small-but-awkward matters like this.

(click to view animation)

As shown in the diagram, the max-flow-min-cost algorithm can distinguish between result nodes with more or less flow coming from the final queried nodes.

The example is simple for brevity; its result can be recreated with a much simpler algorithm that merely counts the in-degree of each result node. However a max-flow-min-cost solution generalises better, e.g. if the final queried nodes gave different numbers of results, we don’t want the larger set to dominate the smaller sets.