GDB with Multi-Threaded Rust

24 Oct 2015

While working on a networked, multi-threaded application written in Rust, I recently found myself in a situation where execution of some serialization code I had written was causing one of the network nodes to hang for some inexplicable reason. I am writing this up in hopes that someone will find the Serde or GDB-specific parts useful, but also to wax poetic about Rust and open source libraries.

The troublesome serialization code I had written was as follows:

impl ser::Serialize for NetNodeTable {
    fn serialize<S: ser::Serializer>(&self, s: &mut S)
          -> Result<(), S::Error> {
        s.visit_struct("NetNodeTable", NetNodeTableVisitor {
            value: self,
            state: 0
        })
    }
}

struct NetNodeTableVisitor<'a> {
    value: &'a NetNodeTable,
    state: u8,
}

impl<'a> ser::MapVisitor for NetNodeTableVisitor<'a> {
    fn visit<S: ser::Serializer>(&mut self, s: &mut S)
            -> Result<Option<()>, S::Error> {
        Ok(Some(try!(s.visit_struct_elt("MQUINN_KEY", "MQUINN_VALUE"))))
    }
}

The Serde library is being used here; specifically, dependencies serde=0.6.1 and serde_json=0.6.0. This code starts with the implementation of the Serialize trait for the NetNodeTable struct, but to keep it as simple as possible, I didn’t actually visit the elements of that struct in the implementation of the MapVisitor trait for NetNodeTableVisitor. Rather, I simply “visited” a hard-coded key of "MQUINN_KEY" and value of "MQUINN_VALUE". My expectation was that the following simple JSON:

{
  "MQUINN_KEY": "MQUINN_VALUE"
}

would be logged at level INFO by the following calling code:

let ref node_table = *node_table.read().unwrap();
let s = serde_json::to_string_pretty(&node_table).unwrap();
info!("FSM: sync'ed node table to disk: {}", s);

I thought this was straightforward code and had no idea why it would cause the entire application to hang, with the info! call never being reached. At first I thought a write lock for node_table was still in scope somewhere, but ruled that out with some well-placed print statements. I narrowed the culprit down to the NetNodeTableVisitor::visit function, which for some reason was refusing to return.

At this point I realized that I needed a debugger, and was glad to learn that there’s a GDB wrapper for Rust called rust-gdb, which comes by default with the Rust binary. I opened two terminals, one for each of two networked nodes, and ran the following commands in each terminal after cd‘ing to the target/debug subdirectory of the application’s Cargo tree:

  • $ rust-gdb <application-name>
  • (gdb) r --config ../../conf/<node>.conf

at which point both network nodes started and spawned the necessary threads, visible in GDB as:

Starting program: node --config ../../conf/nodeB.conf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6381700 (LWP 2032)]
[New Thread 0x7ffff53ff700 (LWP 2033)]
[New Thread 0x7ffff51fe700 (LWP 2034)]
[New Thread 0x7ffff4ffd700 (LWP 2035)]
[New Thread 0x7ffff47ff700 (LWP 2036)]
[New Thread 0x7ffff45fe700 (LWP 2037)]
[New Thread 0x7fffe73ff700 (LWP 2039)]
[New Thread 0x7fffe71fe700 (LWP 2040)]
[New Thread 0x7fffe6ffd700 (LWP 2041)]
[New Thread 0x7fffe6dfc700 (LWP 2042)]
[New Thread 0x7fffe6bfb700 (LWP 2043)]
[Thread 0x7ffff51fe700 (LWP 2034) exited]
[New Thread 0x7ffff51fe700 (LWP 2060)]

I reproduced the situation where one of the nodes would hang when executing my serialization code, then used Ctrl+C to interrupt the instance of GDB debugging that particular node. However, the backtrace GDB provided for the main thread:

Program received signal SIGINT, Interrupt.
0x00007ffff68244db in pthread_join (threadid=140737324259072, thread_return=0x0)
    at pthread_join.c:92
92  pthread_join.c: No such file or directory.

is not useful here, because we need the backtrace for the thread running the serialization code. To have GDB display the backtrace for all threads, I ran (gdb) thread apply all backtrace and scrolled through the threads until I found the relevant one:

...
#15 0x000055555559da63 in certchain::ser::Serializer<W, F>.ser::Serializer::visit_map<certchain::network::NetNodeTableVisitor> (self=0x7ffff637ef78, 
    visitor=NetNodeTableVisitor = {...})
    at /home/mquinn/.cargo/registry/src/github.com-0a35038f75765ae4/serde_json-0.6.0/src/ser.rs:266
#16 0x000055555559d77a in certchain::ser::Serializer::visit_struct<serde_json::ser::Serializer<&mut collections::vec::Vec<u8>, serde_json::ser::PrettyFormatter>,certchain::network::NetNodeTableVisitor> (self=0x7ffff637ef78, _name="NetNodeTable", 
    visitor=NetNodeTableVisitor = {...})
    at /home/mquinn/.cargo/registry/src/github.com-0a35038f75765ae4/serde-0.6.1/src/ser/mod.rs:274
#17 0x000055555559d726 in certchain::network::NetNodeTable.ser::Serialize::serialize<serde_json::ser::Serializer<&mut collections::vec::Vec<u8>, serde_json::ser::PrettyFormatter>>
    (self=0x7ffff542f088, s=0x7ffff637ef78) at src/network.rs:142
...

I’ve elided the irrelevant parts of the trace. At frame #17, you can see my call to serialize an instance of NetNodeTable. When I call visit_struct, that results in a call to serde’s ser module (frame #16), which in turn calls visit_map on my NetNodeTableVisitor argument, resulting in a call to serde_json’s re-exposed ser module in frame #15. Frame #15 points us to line 266 of that module, which turns out to be:

while let Some(()) = try!(visitor.visit(self)) { }

Serde continues to call visit on the visitor the as long as it returns a Some(()) value, which my code above always does. To resolve my hanging issue, I needed to change my visitor to return a None value after “visiting” my hard-coded element:

impl<'a> ser::MapVisitor for NetNodeTableVisitor<'a> {
    fn visit<S: ser::Serializer>(&mut self, s: &mut S)
            -> Result<Option<()>, S::Error> {
        try!(s.visit_struct_elt("x", "MQUINN"));
        Ok(None)
    }
}

Although I can’t find a mention of this nuance in the official Serde documentation, the examples in the README.md of Serde’s GitHub repo do return Ok(None). If I had read that closer I probably could have avoided this rabbit hole, but apparently I had to learn this particular lesson the hard way.

This lesson was a relatively painless one as far as debugging situations go, and it has only increased my enthusiasm for Rust and the open source libraries being written in it. I was able to use a GDB backtrace to identify lines in a third-party library relevant to my issue, then work backwards from there to determine what I was doing wrong. Thanks to Cargo, and of course to the fact that Serde is open source, I didn’t have to go hunting on the Internet for the relevant source code, it was all right there in my home directory. The core developers of Rust have shepherded many complex details into a coherent whole, and experiences such as the one detailed in this post only make me more impressed.