Fun with Distributed Hash Tables
May 13, 2009
The keen reader might have previously noticed my interest in distributed hash tables (DHTs). This interest is strong enough to motivate me to build a DHT library in Haskell, which I discuss here at an absurdly high level.
Distributed Hash Table Crash Course
(skip if you’ve read anything remotely good on DHTs)
A DHT is a sort of peer to peer network typically characterized with no central or “master” node, random node addresses, and uses a form of content addressable storage. Usually implemented as an overlay network and depicted as a ring, we say each node participating in the DHT “has an address on the ring”.
To locate data first you need an address; addresses are generated in different ways for any given DHT but is commonly a hash of either the description (“Fedora ISO”), file location/name (“dht://news”), or even a hash of the file contents themselves. The lookup message is then sent to this address and in doing so will get routed to the node with the closest matching address. The routing is fish-eye: nodes have more knowledge about the nodes with closer addresses and sparse knowledge of further addresses. The result is that the average number of hops to locate a node is logarthmic to the size of the network but so are the size of any one nodes routing table, so the burden isn’t too much.
To ensure correct operation, DHTs keep track of the closest addressed nodes (typically the closest 8 on each side). These nodes make up the ‘leaf set’ and are often used for special purposes such as redundantly storing data if the DHT is for file sharing/storage. It’s “easy” to keep this correct because the final step when joining a ring is to contact the node who’s currently closest to your desired address.
Functionallity of the Haskell DHT Library
Originally I worked toward Pastry like functionallity, but then I read a paper on churn and opted to implement Chord like policies on join, stabilize and key space management.
I’ve implemented the basic operations of routing, periodic leaf set matainence, join operations requiring an atomic operation on only a single (successor) node, and IPv4 support via network-data. This is built on Control-Engine, so you can instruct nodes to route or deliver using as many haskell threads as you want. Beyond that, adding hooks is what Control-Engine is built for, so its easily to plug in modules for load conditioning, authentication, statistics gathering, and arbitrary message mutation.
Using the DHT Library
The main job in starting a node is building the ‘Application’ definition. The application must provide actions to act on delivered messages and notifications about leaf set changes and forwarded messages (giving it the opportunity to alter forwarded messages). Additionally, the app is provided with a ‘route’ method to send messages of its own. I won’t go in to depth on this right now as I’m not yet releasing the library.
Using the hook instructions (and a custom Application definition) I’ve instrumented the code to log when it deals with join requests (sending, forwarding, finishing) and to show the leafset when it changes. Using the resulting logs I produced graphical representations of the ring state for various simulations (graphical work was part of a Portland State course on Functional Languages).
My student site has several simulations, but the most instructive one is LeafSet50 (22MB OGG warning!). Joining nodes are shown in the top area, active nodes are displayed in the ring at the center, thick lines are join requests being forwarded around, and groups of thin lines show the latest leaf set. Aside from revealing corrupt states caused by a broken stabilize routine, you can see some interesting facts for such a crude rendering:
A) Some areas are completely void while others are so dense that four or five nodes are overlapping almost perfectly. This tells us that, at least for sparsly populated DHTs, random assignment is horrible and can result in nodes having many orders of mangnitude larger area of responsibility than their counterparts. If I had bothered to read papers about applications based on DHT libraries then I might have known exactly how bad the situation could be, but it’s interesting to see this visually as well.
B) The simulated storm of joining nodes combined with periodic stabilization results in leaf sets being massively out of date. This concerns me less given the less-than realisic cause and the fact that everything eventually settles down (in ~120 seconds), but might bite me when I start thinking about join performance.
While I know that simulating 500 nodes takes 40% of my CPU time in the steady state (there is lots of sharing of leaf sets to make sure nothing has changed), this can be dramatically decreased by making the LeafSet data structure more efficient. Other than that, I’m just now considering the types of tests I desire to run. There are no serious benchmarks yet, but I hope to understand much more about the network performance such as:
1) Number of messages for joins
2) Number of bytes used for joins
3) Bandwidth needed for the steady state
4) Message rate in the steady state
Its not named! So job one is for me to figure out a name.
Code changes needed:
Remove nodes that don’t respond to stabilize messages (they aren’t deleted currently – they just persist).
Check the liveness of nodes appearing in the route table (only leaf set nodes are checked right now)
Generalize about the network – don’t mandate IPv4!
Polymorphic ‘Message a’ type – don’t assume H-DHT does the serialization to/from lazy bytestrings!
Stabilize is rudementary – check the sequence number to make sure responses are current!
Basic security/correctness mechanisms are still needed. When other nodes send route table rows and leaf sets we just add those entries into our own structure without any confirmation.
Implement Pastry proximity work – this will require the application to provide a function “proximity :: (NodeId, IPv4,Port) -> IO Int”.
Don’t always route to the optimal node, implement some jitter. Perhaps use the provably competative adaptive routing proposed by Awerbuch et al. for MANETs.