DynamoDB Backups with AWS Lambda and Go, Part One

This is the first post in a series about creating a Go-based AWS Lambda to back up DynamoDB tables. This post outlines the process of limiting DynamoDB consumed read capacity. The resulting project is available on Github.

Inspiration

Perhaps the most important of all organizational best practices: backups, backups, backups. As it turns out, creating a backup system for a DynamoDB table is quick and easy thanks to AWS' DataPipeline product — but it's not cheap. Running a nightly backup of several tables can generate a surprisingly high cost, as each backup invocation requires EC2 and Hadoop resources and, regardless of the size of your tables, you'll be billed at least two hours for EC2 resources and one hour for Hadoop resources. For an organization I work with, DataPipeline backups were responsible for over half the total AWS spend!

Solution

After some research and several hours of work, I was able to create a Go-based solution to backup DynamoDB tables quickly, inexpensively, and most importantly, reliably.

Prerequisites

At the time of writing this blog, AWS Lambda does not natively support Go programs. Thankfully, the industrious folks at Eawsy have created the AWS Lambda Go Shim, a fantastic tool that handles the compilation and packaging of your Go program in a Lambda-friendly way. Unsurprisingly, the program also required the AWS Go SDK to interact with AWS services.

Code

For small tables, or for tables with large provisioned read capacities, requesting objects from DynamoDB is simple:

input := &dynamodb.ScanInput{  
  Table: aws.String('dynamodb-test-table'),
  ConsistentRead: aws.Bool(true),
}

output, err := connection.Scan(input)  
// connection is a *dynamodb.DynamoDB
// output is a *dynamodb.ScanOutput
// output.Items is a []map[string]*dynamodb.AttributeValue

For larger data sets, it may be important to limit the consumed read capacity of the backup process to prevent throttling or otherwise impacting the performance of the table. At this point, I recommend taking a moment to read up on how AWS calculates throughput capacity consumption.

We can do this by inspecting the value of output.ConsumedCapacity.CapacityUnits (of type *float64). By default, this float64 pointer will have a nil value and attempting to dereference it will cause an unplanned early exit to our program — and when running Go programs on Lambda using the AWS Lambda Go Shim, null pointer dereferences end a program silently. To ensure that the value is available to use, we have to explicitly request that DynamoDB return it in our dynamodb.ScanInput struct:

input := &dynamodb.ScanInput{  
  Table: aws.String('dynamodb-test-table'),
  ConsistentRead: aws.Bool(true),
  ReturnConsumedCapacity: aws.String("TOTAL"),
}

To keep our consumed capacity below our self-imposed limit, we'll break the database scan operation into multiple smaller scans. DynamoDB calculates consumed capacity on a per-second basis, so it's convenient to use Go's time.Tick() method to create a channel that will emit time.Time values each second to rate limit our requests. The loop that will be doing the heavy lifting looks like this:

for range time.Tick(time.Second) {  
  // request additional objects
  // serialize new objects
  // add serialized objects to backup data
}

As we're now breaking the scan operation into multiple smaller scans, we're faced with two new challenges: ensuring we're retrieving all available objects at the time of the backup (and retrieving them only once), and making an effort to remain under our consumed capacity limit.

The first challenge turns out to be simple enough: we can add an ExclusiveStartKey field to our dynamodb.ScanInput struct to tell DynamoDB the last item returned by the previous scan operation. In response, DynamoDB will return a set of objects beginning with the next object in the database. The dynamodb.ScanOutput struct returned by the dynamodb.Scan() method, by default, contains a LastEvaluatedKey field (of type map[string]*dynamodb.AttributeValue) that you can pass directly along to your dynamodb.ScanInput struct, like so:

input := &dynamodb.ScanInput{  
  Table: aws.String('dynamodb-test-table'),
  ConsistentRead: aws.Bool(true),
  ReturnConsumedCapacity: aws.String("TOTAL"),
  ExclusiveStartKey: output.LastEvaluatedKey,
}

Frustratingly, we can't request a maximum consumed capacity on DynamoDB operations, which makes the second challenge a bit more complicated. We'll have to inspect the consumed capacity of the previous scan and scale our next request accordingly in response. We can set a limit on the number of objects returned by a scan operation using the aptly-named Limit field.

I found it helpful to intially limit the scans to a single object: this is, obviously, the minimum number of objects that can be returned, and will allow us to scale up to experimentally find our capacity limit. Assuming we're keeping the returned object limit in a variable called scanLimit, our dynamodb.ScanInput struct is now initialized like so:

input := &dynamodb.ScanInput{  
  Table: aws.String('dynamodb-test-table'),
  ConsistentRead: aws.Bool(true),
  ReturnConsumedCapacity: aws.String("TOTAL"),
  ExclusiveStartKey: output.LastEvaluatedKey,
  Limit: &scanLimit,  // scanLimit is an int64
}

Calculating the new limit is simple enough using the dynamodb.ScanOutput struct returned by the Scan operation:

ratio := targetCapacity / *output.ConsumedCapacity.CapacityUnits  
// targetCapacity is a float64

scanLimit = int64(math.Floor(float64(scanLimit) * ratio))  

It's important to note here that this is where DynamoDB makes things difficult: despite returning output.ConsumedCapacity.CapacityUnits as a float64 pointer, the returned values are only ever integers represented as float64 with a value of at least 1. This reduces the precision of our limit calculation, as a scan of even a single small object will return a consumed capacity of 1.0. For this reason, I recommend using a minimum target capacity of at least 2. Here's an example of how scanLimit will scale in response to the returned consumed capacity value with a target value of 2:

Iteration 0
scanLimit = 1  
*output.ConsumedCapacity.CapacityUnits = 1.0
scanLimit = 1 * (2.0 / 1.0) = 2  
Iteration 1
scanLimit = 2  
*output.ConsumedCapacity.CapacityUnits = 1.0
scanLimit = 2 * (2.0 / 1.0) = 4  
Iteration 2
scanLimit = 4  
*output.ConsumedCapacity.CapacityUnits = 2.0
scanLimit = 4 * (2.0 / 2.0) = 4  

In this example, a target read capacity of 2.0 will return 4 objects per second.

Pitfalls

The program being described here is not appropriate for all DynamoDB tables. We're running this backup program as a Lambda, and with that comes certain limitations:

  • Limited compute time (maximum five minutes)
  • Limited compute memory (maximum 1536 MB)
  • Read-only filesystem

As I've mentioned above, we're executing database scan operations at a rate of one per second. This means that we can complete no more than 300 scan operations before the Lambda will time out. By varying the target consumed capacity, we can vary the number of objects returned before a Lambda timeout. If we find a table's backup process as requiring more execution time or requiring a high target read capacity, that table should be backed up using AWS' out-of-the-box DataPipeline table backup system.

A Lambda's read-only filesystem requires all serialized objects to exist in memory. This, combined with the limited compute memory available, places a hard upper limit on the size of a table that can be backed up by this system. If a table's backup process fails due to a lack of free memory, that table should be backed up using AWS' out-of-the-box DataPipeline table backup system.

Next Steps

In the next post, I'll discuss the steps taken to serialize the data collected by our rate-limited DynamoDB scans.