Technical Memo | Roblox Engineering | Postmortem

DataStore Write Correctness: Handling Timeout-But-Committed Operations

A two-step lesson from production. The fix I shipped in March seemed obviously correct. Eight weeks later it produced a confirmed silent dupe.

In Roblox's DataStore system, write operations can return an error to the caller while having committed successfully on the backend. This single fact is the source of an entire class of state-corruption bugs — duplicated items, inflated currency, ghost transactions — that surface most often in transactional contexts: purchases, trades, mailboxes, anything that removes from one place and adds to another.

This article is the story of fixing one such vector twice. The first fix looked correct, passed code review, and shipped to production. Eight weeks later it produced a confirmed silent dupe at 2026-05-07 23:37:25 UTC. What follows is what I got wrong, why the wrongness wasn't visible until the right circumstances aligned, and what I shipped instead.

The Original Problem

Our mailbox send flow looked like the canonical transactional pattern: remove the item from the sender locally, register a refund callback, then make the cross-account DataStore write. If anything failed downstream, the refund fired and put the item back. The implicit assumption was that an error response means the write didn't happen.

-- Take the item from the sender locally
itemTx:RemoveExact(item)
itemTx:Commit()

-- Register a refund: if anything fails downstream, give it back
tx:addCallback(function()
    refundTx:AddRelaxed(item:Clone())
    refundTx:Commit()
end)

-- Cross-account write: insert mail into recipient's mailbox
local success, _ = Datastore.Update(recipientMailboxKey, transformFn)
if not success then
    return "Send failed!"  -- transaction fails → refund fires
end

refundAvailable = false

In December 2025 this assumption produced a months-long dupe campaign. Roblox's :UpdateAsync had returned errors for writes that had actually committed — network glitches, gateway timeouts, 5xx responses from API Services — and our refund-on-failure had restored the sender's item while the recipient also held it. Hundreds of thousands of duplicated pet UIDs entered circulation before we identified the pattern.

The fix was clear in principle: don't trust an error response. Verify whether the write actually committed before allowing the refund.

Fix v1: Verify Before Refund

I added a verification step. Before the refund proceeded, do a follow-up read of the destination and check whether the write actually committed. If the mail was found in the recipient's inbox, block the refund. If it wasn't, the refund was safe.

local transformCommitted = false
local success, _ = Datastore.Update(recipientMailboxKey, function(oldMailbox)
    table.insert(oldMailbox.Inbox, mail)
    transformCommitted = true
    return oldMailbox
end)

if not success and transformCommitted then
    -- Verify before refunding
    local recipientMailbox = Datastore.Get(recipientMailboxKey, noCache=true)
    local foundInMailbox = false
    if recipientMailbox then
        for _, entry in ipairs(recipientMailbox.Inbox) do
            if entry.uuid == mail.uuid then
                foundInMailbox = true
                break
            end
        end
    end
    if foundInMailbox then
        refundAvailable = false  -- write committed, don't refund
    end
end

This shipped to production in March 2026. Code review approved it. Internal testing didn't reproduce a dupe. I treated the December campaign as closed.

The pattern looks defensible. The noCache=true option bypasses the local DataStore cache, so we should see the latest state. If the write committed, the verify finds the mail and blocks the refund. If the write genuinely didn't land, the verify won't find it and the refund proceeds.

It looks correct. It is not.

What I Missed

The verify is racing against an invisible clock.

noCache=true bypasses Roblox's local DataStore cache on the calling server. It does not bypass the backend's read-replica replication lag. When :UpdateAsync returns an error after a 5xx response, the underlying write may have committed on the primary — but it has not yet propagated to the read replica that Datastore.Get consults. A verification read in the next few milliseconds returns stale data and reports "not committed" even though the write will be visible seconds later.

The verify is structurally unable to distinguish "the write didn't commit" from "the write committed but I can't see it yet." Both look identical to the calling code: a successful read that doesn't contain the expected entry.

Under normal load the eventual-consistency window is short enough that this rarely fires. But Roblox's DataStore is not always under normal load. During brief 5xx storms — internal API Services hiccups, dyno restarts, Mongo failovers — the window opens wider. And those are exactly the moments when timeouts are spiking.

I didn't know any of this in March. The pattern was discovered by independent telemetry I'd added for an unrelated reason: a 60-second post-refund re-verify that was meant to catch a different bug class. It instead caught fix v1 producing a dupe in production.

The Production Incident

At 2026-05-07 23:37:25 UTC, account A sent a Huge Egg Piggy to account B via the in-game mailbox. The trace:

Sender's local item-removal committed.
Datastore.Update for the recipient's mailbox was called. Inside the transform, the mail entry was appended and transformCommitted was set.
The outer call returned: 502: API Services rejected request with error. Error code: 0 Reason: An internal server error occurred.
Fix v1's verify-before-refund block activated. It called Datastore.Get(recipientMailboxKey, noCache=true), looped through the returned inbox, and did not find the mail UUID.
With foundInMailbox = false, the refund proceeded. The sender's pet was restored.
Sixty seconds later, the independent re-verify hook re-read the recipient's mailbox and found the mail UUID present. The write had eventually settled.

The sender retained the pet. The recipient received the same UID. A confirmed silent dupe.

The verify did the right thing with the data it had. The data it had was wrong.

I'd hit the pitfall the entire fix had been designed to avoid. The refund-on-failure pattern from December had become refund-on-failed-verify, and the failed verify was lying to me.

Fix v2: Trust the Transform, Not the Verify

The flaw in verify-before-refund is that the verify is a probabilistic signal, not a deterministic one. A negative verify means "either the write didn't commit, or it did but I can't see it." Fix v1 was treating it as the first when it could be the second.

The signal that is deterministic is whether the transform itself ran. Inside :UpdateAsync, the transform function executes only if the request reached the backend and the read of the current value succeeded. The flag I already had — transformCommitted — was a stronger signal than the verify could ever be:

transformCommitted = false → the request never made it past the read; safe to refund.
transformCommitted = true → the transform ran and returned a value; the backend was attempting to write your new state.

When transformCommitted = true and the outer call still reports failure, the safe move is to assume the write committed regardless of what a verify read would say. The verify is too racy to be authoritative; the conservative refund-block is not.

local transformCommitted = false
local success, _ = Datastore.Update(recipientMailboxKey, function(oldMailbox)
    table.insert(oldMailbox.Inbox, mail)
    transformCommitted = true
    return oldMailbox
end)

if not success and transformCommitted then
    -- Conservative: the transform ran. The write probably committed, even
    -- if a verify read can't see it yet (eventual consistency lag). Don't
    -- refund — accept the risk that we occasionally lose an item to a
    -- genuine rollback in exchange for never producing a silent dupe.
    refundAvailable = false
end

The verify can still be useful as telemetry — log what it returns so you can measure how often the eventual-consistency window bites you — but its result no longer drives the refund decision.

This shipped on 2026-05-08, the day after the silent dupe was identified. Zero silent dupes since.

The Trade-off

This pattern is conservative on purpose. It chooses one well-defined cost over a worse-defined one.

The cost we accept: when a write genuinely rolls back — the transform ran, the response said failed, and the write actually didn't commit — the sender visibly loses the item. From their perspective the send "failed" but their item is gone. They can ask support to restore it. This case is rare; a true rollback after a successful transform requires the backend to abort mid-write, which is much less common than a 5xx response that loses the reply but commits the data.

The cost we refuse: a silent dupe. Two players hold the same UID, the in-game economy inflates by exactly one rare item, and the bug is invisible to both players involved. The sender thinks the send failed and was refunded. The recipient thinks the send succeeded and they received it. The harm spreads through the trading market for as long as the duped UID stays in circulation.

The asymmetry is the point. A visibly-failed send with a support ticket is recoverable. A silent dupe is not.

What I'd Tell Past Me

Two things I wish I'd internalized before shipping fix v1:

A read after a write doesn't see your write right away. This sounds basic, and on a single-replica system it is. On any system with read replicas — Roblox DataStore included — the write visibility lag is a real number that varies under load. Any "did my write commit?" check that happens within that window is racing. You can't fix this with cache-bypass options; cache-bypass only addresses local caching, not backend replication.
The deterministic signal is upstream of the verify. The transform either ran or it didn't. That fact is captured at the moment the transform body executes, not after the response comes back. If you have a way to record that the transform ran (a flag set inside the function passed to :UpdateAsync), that signal is harder to fool than any post-hoc read.

When you see "verify before compensating" as a pattern, ask: is the verify happening within the eventual-consistency window of the system you're verifying against? If yes, the verify is noise.

Why this generalizes

The pattern is not specific to Roblox DataStore. Any system where you take action against state, observe an error response, and consider a compensating action faces the same hazard.

The HTTP equivalents are the most common. A POST to a remote service that returns a 5xx may have committed; a refund based on the assumption it didn't is the same dupe shape. The same transformCommitted-style signal applies wherever you can observe whether your request reached the server's processing path: a database transaction's pre-commit phase, an idempotency-key write that survived the reply loss, a queue's pre-acknowledgment receipt.

Idempotency keys are the more robust long-term solution: if every request carries a unique key and the server dedupes on it, retries become safe and the entire timeout-but-committed class of bugs goes away. But idempotency keys require coordinating both sides; the conservative refund-block is a one-sided fix that you can ship without upstream changes.

If you operate transactional state at scale on Roblox DataStore — or anywhere with eventual-consistency reads — audit your refund paths for two things: whether they have any verify-then-decide logic that could race against backend lag, and whether they have a transformCommitted-style flag inside the transform that can drive the conservative-block decision. The first is a bug waiting for the right load conditions; the second is the fix.

Production source: BIG Games, Pet Simulator 99. Fix v1 deployed 2026-03-18. Silent dupe identified 2026-05-08 from independent post-refund instrumentation. Fix v2 deployed 2026-05-08.

← Back to writing