Managing Goroutines
It’s surprisingly easy to start goroutines. Unfortunately, it isn’t quite as easy to orchestrate their cleanup. Avoiding deadlocks is also challenging. Most often this boils down to an ordering problem, where a goroutine receiving on a go-chan exits before the upstream goroutines sending on it.
Why care at all though? It’s simple, an orphaned goroutine is a memory leak. Memory leaks in long running daemons are bad, especially when the expectation is that your process will be stable when all else fails.
To further complicate things, a typical nsqd process has many goroutines involved in message delivery. Internally, message “ownership” changes often. To be able to shutdown cleanly, it’s incredibly important to account for all intraprocess messages.
Although there aren’t any magic bullets, the following techniques make it a little easier to manage…
WaitGroups
The sync
package provides sync.WaitGroup
, which can be used to perform accounting of how many goroutines are live (and provide a means to wait on their exit).
To reduce the typical boilerplate, nsqd uses this wrapper:
type WaitGroupWrapper struct {
sync.WaitGroup
}
func (w *WaitGroupWrapper) Wrap(cb func()) {
w.Add(1)
go func() {
cb()
w.Done()
}()
}
// can be used as follows:
wg := WaitGroupWrapper{}
wg.Wrap(func() { n.idPump() })
...
wg.Wait()
Exit Signaling
The easiest way to trigger an event in multiple child goroutines is to provide a single go-chan that you close when ready. All pending receives on that go-chan will activate, rather than having to send a separate signal to each goroutine.
func work() {
exitChan := make(chan int)
go task1(exitChan)
go task2(exitChan)
time.Sleep(5 * time.Second)
close(exitChan)
}
func task1(exitChan chan int) {
<-exitChan
log.Printf("task1 exiting")
}
func task2(exitChan chan int) {
<-exitChan
log.Printf("task2 exiting")
}
Synchronizing Exit
It was quite difficult to implement a reliable, deadlock free, exit path that accounted for all in-flight messages. A few tips:
-
Ideally the goroutine responsible for sending on a go-chan should also be responsible for closing it.
-
If messages cannot be lost, ensure that pertinent go-chans are emptied (especially unbuffered ones!) to guarantee senders can make progress.
-
Alternatively, if a message is no longer relevant, sends on a single go-chan should be converted to a
select
with the addition of an exit signal (as discussed above) to guarantee progress. -
The general order should be:
- Stop accepting new connections (close listeners)
- Signal exit to child goroutines (see above)
- Wait on
WaitGroup
for goroutine exit (see above) - Recover buffered data
- Flush anything left to disk
Logging
Finally, the most important tool at your disposal is to log the entrance and exit of your goroutines!. It makes it infinitely easier to identify the culprit in the case of deadlocks or leaks.
nsqd log lines include information to correlate goroutines with their siblings (and parent), such as the client’s remote address or the topic/channel name.
The logs are verbose, but not verbose to the point where the log is overwhelming. There’s a fine line, but nsqd leans towards the side of having more information in the logs when a fault occurs rather than trying to reduce chattiness at the expense of usefulness.