S3 — Simple Storage Service — Is it Really?

5 min readJul 2, 2019

Note: This article is now outdated and only serves as a reminder of the struggles we went through when S3 was eventually consistent. S3 as of December 2020 offers strong read-after-write consistency — https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

S3 is a versatile storage service — in fact, Jeff Bezos wanted it to be the malloc of internet when the service was first created. In other words, it was going to be the RAM of the cloud. That is a noble aspiration and logical assumption since distributed storage systems are not dissimilar to the CPU cache+RAM we have on a multicore multisocket server machine.

Lambda (serverless) cemented this further by making it the shared memory across different Lambda functions or the same Lambda function across separate invocations. This happens a lot when you’re doing Step Functions on AWS.

At this point, I expect reactions from 2 groups of readers.

First group: This makes sense. S3 is the most natural place to store shared data for lambda.
Second group: Alarm bells ringing. Using S3 by itself for shared data is a big no-no!

Most people belong to the first group. We naively did too, thinking “if you want a Simple Storage Service on AWS, use S3.”

S3 is NOT Simple

Here is where the name S3 oversells and under delivers. Name is a powerful marketing tool and “simple” implies that it is simple from user’s perspective. In S3’s case, it is simple from AWS’ perspective — i.e. simple because it doesn’t deal with the complexities of strong consistencies (CAP theorem says it is notoriously hard to achieve). In fact, the first version of S3 did not have strong consistencies even for PUT of new objects, unlike the version that we have today. This is roughly equivalent to the weak memory model using Bezos’ RAM analogy.

In the CPU world, your OS would crash when running on a weak memory model architecture before booting up if it did not have FENCE instructions. Without going into too much detail, this instruction essentially asks the CPU to commit any outstanding writes to system memory so other CPUs can read the updates. This is equivalent to sending a bunch of objects to S3 and asking it to commit before returning 200 OK. Except there’s no such thing on S3.

Eventual Consistency

Therein lies the eventual consistency problem. Since you can’t issue a FENCE instruction,

You won’t know if the object you are GET-ting after an update is stale.
You can’t LIST to get the latest objects (LIST is eventually consistent too).
For versioned objects, you can get 404 Not Found even if you have methodically made sure you don’t delete older versions until you’ve successfully PUT new versions.

The “Quick Explanation of the S3 Consistency Model” article explains in greater details. Note that neither the article nor official AWS documentation mentions anything about order consistency (i.e. it could still be eventually consistent with in-order propagation) but we raised a ticket via AWS Support and had confirmation that it is out-of-order eventual consistency — i.e. the weakest kind of eventual consistency, which is the cause to the effect explained in the last bullet point.

This means that across different Lambda invocations, you need to design around the fact that the persisted data returned from S3 is going to be stale some times (as an engineer, you should assume it is all the time even if S3 can give you a p99 latency of 1ms).

How do we workaround this problem?

The world has collectively spent countless man-hours on working around S3’s eventual consistency problem, including AWS themselves — Amazon Elastic MapReduce, which was derived from Netflix’s own effort called S3mper. Unfortunately, there’s no universal solution to this problem. Every use-case will require a different solution if you want to keep your cost reasonable, on AWS at least (Azure Storage on the other hand has a strong consistency model).

In cases where you can guarantee your payload will be less than 400KB, DynamoDB is your best bet (ed: 400K memory is more than anyone will ever need on the cloud?). Keep in mind that it can be really costly though — storage on DynamoDB is more than an order of magnitude more expensive than S3. If you can’t, then you’ll have to combine DynamoDB with S3 to get strong consistency. The latter is the solution we went for. Our payloads can be bigger than 400KB, and why not be cost conscious too while we’re at it?

Step-by-step Workaround

Update object on S3 by creating a new version instead
Store this versionID in DynamoDB
Delete the old version off S3 (to keep storage cost down)
When we GET the object from S3, we get it with the versionID we’ve stored in DynamoDB (make sure you read from DynamoDB with the strong consistency flag set to true)
(Optional) You might also want to set bucket to expire non current version in case delete fails due to network issue / S3 outage

Note: PUT of versioned objects in S3 is always strongly consistent when we GET it with the versionID returned by the PUT request — i.e. it is equivalent to a GET of new object which is strongly consistent.

Extra Cost

This, however, means that we have incurred costs on both S3 and DynamoDB. Fortunately, DynamoDB’s GET/PUT request costs (RCU/WCU costs) are only around a quarter of S3’s as we only need 1 RCU/WCU per GET/PUT since we are only reading/storing the object version, not the whole object. We only incur ~25% higher cost to guarantee strong consistency.

While the additional AWS cost was not significant, the entire exercise was very costly from a man-hour / engineering perspective. This stems from the fact that DynamoDB does not support transactional operations across S3, so error handling can never be robust. We had to come up with a fairly convoluted logic for handling errors to cover all possible permutations. Even then, it still not 100% robust and it will never be. That will have to do for now.

TL;DR

To answer the titular question, S3’s eventual consistency model makes it very convoluted if you want to use it to persist data across lambda function invocations. While there are several ways around this problem (including those not covered in this article such as using Redis cache in place of DynamoDB or Aurora in place of S3), we chose to use DynamoDB to store our S3 object’s VersionID so that we can rely on S3’s strong consistency in PUT and GET of new objects to ensure the object is strongly consistent.