Go Patterns: Retries

by about Go, Software Engineering in Technology

When working on microservices or any network-related code retries are a must. Go has a few neat features that help with creating an easy-to-use retry library.

Let’s face it: networks are unreliable. When working in our cozy dev environment, everything seems fine. We can just send off an HTTP call to another service without a care in the world. The only reason the call can fail is because either we messed up (HTTP 400), or the service messed up (HTTP 500).

However, out in the wild, in the production world things are different. Partially because networks are inherently unreliable: when a network link is near capacity it can start dropping packets, routing issues can cause connections to break, a service could have a brief (unintentional) outage during an upgrade, DNS could fail, or a sysadmin could accidentally loosen the network cable so that it randomly disconnects. (Yes, I’ve done that in my sysadmin days, which cost the owner of that server a fair few gray hairs.)

You’d think these problems occur so infrequently that they are not even worth dealing with. In small systems with a few hosts this might even be true. However, when you start hitting half a dozen servers, work on geographically distributed systems, or start working with third party APIs you’ll start encountering these problems a lot. You might not even see them in the logs, because network problems often have knock-on effects due to missing error handling.

If you want to make your system more robust you may want to be on the lookout for code like this:

func FetchInformation() (io.Reader, error) {
    resp, err := http.Get("http://example.com/")
    if err != nil {
        return nil, err
    }
    return resp.Body
}

There are no retries and barely any error handling. If this HTTP call fails for any reason it is not retried. If this call is part of a complex process, the process will most likely be stuck in the middle. (What’s worse, this code doesn’t even check the HTTP status code in the response.)

We are all human by nature and just want to get things done. This is why most developers (often yours truly too) code for the happy path. This is especially true when something seems fairly complex and bothersome to add in.

Basic retries ▲ Back to top

Before we go straight to the elegant solution, let’s take a look at a basic example of retries. We’ll extend the previous example by a for loop and track the number of retries. We’ll also wait 5 seconds between retries.

func FetchInformation() (io.Reader, error) {
    tries := 0
    for {
        resp, err := http.Get("http://example.com/")
        if err == nil {
            return resp.Body
        }
        tries++
        if tries > 3 {
            return fmt.Errorf("failed to fetch information (last error: %w)", err)
        }
        <-time.After(5 * time.Second)
    }
}

This example is fairly simple, but you can already see why coding for the happy path is so common: it’s a lot of boilerplate code. Don’t worry, we’ll take a look at a much more elegant solution soon, but let’s explore a little further first.

Using contexts ▲ Back to top

The previous example only counted the number of retries. What if we wanted to add a timeout instead? In Go, contexts are a standardized way to pass on context information. We’ll add the context parameter to our function and then use the select block to stop when either the retry timer runs out, or the context expires.

func FetchInformation(
    ctx context.Context,
) (io.Reader, error) {
    for {
        resp, err := http.Get("http://example.com/")
        if err == nil {
            return resp.Body
        }
        select {
        case <-ctx.Done():
            return nil, fmt.Errorf("timeout while fetching information (last error: %w)", err)
        case <-time.After(5 * time.Second):
        }
    }
}

We can then call the FetchInformation() function like this:

ctx, cancel := context.WithTimeout(
    context.Background(),
    10 * time.Second,
)
defer cancel()
reader, err := FetchInformation(ctx)
Always cancel your contexts!
Be sure to always call cancel() on your contexts. Timeout contexts start a goroutine in the background, which will be left dangling if you don’t. Golangci-lint can help you discover these issues in your code.

Creating a reusable retry function ▲ Back to top

All right — There is no way I’m adding that everywhere! — I hear you say. And you are right! Adding this everywhere would produce a lot of boilerplate code.

Let’s create a reusable retry function that we can just elegantly plop into our code to add retries without much hassle. Next to the context parameter we’ll request a function to run. This function will return only an error.

In our loop we’ll run this function and handle the error like in the previous example.

func retry(
    ctx context.Context,
    what func() error,
) error {
    for {
        err := what()
        if err == nil {
            return nil
        }
        select {
        case <-ctx.Done():
            return fmt.Errorf("timeout (%w)", err)
        case <-time.After(5 * time.Second):
        }
    }
}

Now we can use this function in our previous example. First, we’ll add named returns to the function to make our lives easier. Then we call the retry function and pass an anonymous function to it.

In the anonymous function we’ll do our work and even throw in some extra error handling for bad status codes. If everything went well, we’ll set the result return variable of the outer function.

func FetchInformation(
    ctx context.Context,
) (result io.Reader, err error) {
    err = retry(
        ctx,
        func() error {
            resp, err := http.Get("http://example.com/")
            if err != nil {
                return err
            }
            if resp.StatusCode != 200 {
                _ = resp.Body.Close()
                return fmt.Errorf(
                    "invalid status code from example call: %d",
                    resp.StatusCode,
                )
            }
            result = resp.Body
            return nil
        })
    return
}

This might be a bit tricky to wrap your head around. We want our retry() to be generic, so we can’t include return parameters. That’s why we are setting the return variable of the outer function from the anonymous inner function.

Adding some charm ▲ Back to top

So far so good, that retry() function really makes it easy to add multiple tries to your calls. However, it seems a bit of a hassle to always deal with contexts. What if we added flexible retry options?

If you think about it, there are three points in our retry code:

  1. After running the function, we need to decide if the error can be retried.
  2. We need to wait for either a backoff timer to expire, or a timeout to happen.
  3. Depending on if the backoff timer expired, or a timeout happened, we either retry or exit.

So, let’s write an interface for it. We create the CanRetry function to decide if the error can be retried. Then we’ll add a function that returns a wait timer. This function will return a channel that we will close when the timer expires. (This is a trick that the time.After function uses.) Since we want to support multiple channel types we’ll return interface{}. Finally, we’ll add a hook that lets the strategy decide what to do if its timer expires first. It can either return an error to abort the retry loop or nil to do another retry.

Let’s add that to our retry function. We’ll take the strategies as variadic parameters and loop over them if an error happens. If any of the strategies indicate that it shouldn’t be retried the loop exits.

Next, we compile a list of channels from the WaitTimer() function of the passed strategies. We will use a special form of the Go select clause from the reflection library. Once we have compiled the list of channels, we’ll call select and receive whichever strategy returned first. Finally, we’ll call the OnWaitTimerExpired() function on the strategy as discussed before.

func retry(
    what func() error,
    strategies ...RetryStrategy,
) error {
    for {
        err := what()
        if err == nil {
            for _, strategy := range strategies {
                if !strategy.CanRetry(err) {
                    return err
                }
            }
        }
        var timerChannels []reflect.SelectCase
        for _, strategy := range strategies {
            timerChannel := strategy.WaitTimer(err)
            if timerChannel != nil {
                timerChannels = append(timerChannels, reflect.SelectCase{
                    Dir:  reflect.SelectRecv,
                    Chan: reflect.ValueOf(timerChannel),
                    Send: reflect.Value{},
                })
            }
        }
        firstChannelNumber, _, _ := reflect.Select(timerChannels)
        strategy := strategies[firstChannelNumber]
        if err := strategy.OnWaitTimerExpired(err); err != nil {
            return err
        }
    }
}

For a real world scenario you’d want to add some default strategies, but more on that later. Let’s implement the previously discussed context expiry as a strategy.

In this strategy we will always return true in the CanRetry() function since we only want to abort if the context expires. We simply return the channel from ctx.Done() and return an error if this strategy is the first to finish.

type contextStrategy struct {
    ctx context.Context
}

func (c *contextStrategy) CanRetry(err error) bool {
    return true
}
func (c *contextStrategy) WaitTimer() interface{} {
    return c.ctx.Done()
}
func (c *contextStrategy) OnWaitTimerExpired(err error) error {
    return fmt.Errorf("timeout (%w)", err)
}

There are, of course, still a few things to be implemented for a production-ready retry function. If you want a more complete example you can take a look at our example on GitHub. There are, of course, a plethora of libraries on GitHub that do this too.