Single Cosmos DB for MongoDB collection unreachable — 503/SubStatus 21005 on management plane, MongoDB error 50 (MaxTimeMSExpired) on data plane, sibling collections healthy
TL;DR
On a serverless Azure Cosmos DB for MongoDB account in West Europe, exactly one collection is unreachable on every control surface (portal, ARM, az CLI, pymongo) while every other collection in the same database responds normally. Account-level health is green (0 throttles, 0% normalized RU). Looks partition- or metadata-scoped to the one collection. I have a support request open in parallel; posting here in case anyone has seen this signature or can sanity-check the diagnosis.
Environment
- API: Azure Cosmos DB for MongoDB
- Capacity mode: Serverless (
EnableMongo + EnableServerless)
- Region: West Europe, single region, no multi-write
- Backup policy: Periodic, 4h interval, 8h retention, Geo-redundant (no PITR)
- Affected collection:
mails
- Driver:
pymongo 4.13.1, AsyncMongoClient
- Connection settings:
serverSelectionTimeoutMS=60000, connectTimeoutMS=60000, socketTimeoutMS=120000, maxIdleTimeMS=120000, retryWrites=False, minPoolSize=1
- TLS: default (TLS 1.2)
Symptoms
1. Management plane — fails
GET /dbs/<db-name>/colls/mails:
HTTP 503 ServiceUnavailable, SubStatus 21005
backend: "Server returned a 410 response."
36.6s total across 3 retries (515ms / 5008ms cancelled / 30065ms timeout)
az cosmosdb mongodb collection throughput show -n mails:
HTTP 504 GatewayTimeout
"Gateway did not receive a response from Microsoft.DocumentDB
within the specified time period."
Azure portal cannot open the collection (same backend error).
2. Data plane — fails
find_one_and_update upserts on mails time out server-side with MongoDB error code 50 (MaxTimeMSExpired). pymongo surfaces it as NetworkTimeout after the 60s deadline:
<account>-westeurope.mongo.cosmos.azure.com:10255: timed out
(configured timeoutMS: 60000.0ms, connectTimeoutMS: 60000.0ms)
3. Telemetry plane — degraded
Azure Monitor returns no per-CollectionName values for mails on DocumentCount, DataUsage, or PhysicalPartitionSizeInfo, while every other collection in the same database reports normally. PhysicalPartitionSizeInfo filtered to CollectionName='mails' returns "Physical partitions found: 0".
4. Sibling collections — fine
All other collections respond normally on both management and data plane.
ActivityIds
- Management-plane GET (503/21005):
5671c1bb-7182-4ae4-ba21-3d3ca6e962db
- Outer correlation:
24e9898a-4852-11f1-9358-2615164c6ab5
Per Azure Monitor MongoRequests, the last three successful operations on mails were on 2026-04-30 (UTC) — that's the before/after window for whatever changed on the partition.
Collection shape (what I know without being able to read it)
- Default
_id shard key — no custom shard key was ever set by our code.
- Volume is tiny: 6 ops in the 7 days before the incident (3 successful upserts on 2026-04-30, then nothing until 3 timeouts on 2026-05-04). Documents are ~1 KB each (one MJML email body + metadata).
- For comparison the largest collection on the account is 8.66 MB / 461 documents. Nothing here is partition-pressure-bound.
Timeline (UTC)
| Date |
Event |
| 2026-02-06 |
Account created |
| 2026-04-30 |
mails upserts succeeded, 3/3 in MongoRequests |
| 2026-05-01 → 05-03 |
No mails traffic |
| 2026-05-04 07:00 |
First MongoRequests ErrorCode 50 on mails |
| 2026-05-04 10:00 |
Second |
| 2026-05-04 13:00 |
Third (matched application alert at 13:51 UTC) |
| 2026-05-04 → |
100% failure on mails; management-plane GET also fails |
Account-level health (green)
-
TotalRequests StatusCode=429: 0 across the full day
-
NormalizedRUConsumption peak: 0%
- All other collections: management reads <1s, data-plane writes succeed.
Hypothesis
Two readings of the telemetry both point to a fault scoped to the single collection:
- Degraded physical partition hosting
mails — backend returns 410 to the gateway, surfacing as 503/SubStatus 21005 on the control plane and as MaxTimeMSExpired on the data plane.
- Orphaned/corrupted collection metadata — consistent with
PhysicalPartitionSizeInfo reporting zero partitions for mails while siblings report normally; nothing healthy to route to.
Either way the blast radius is one collection on one account. Staging and dev (different subscriptions, different accounts) are unaffected.
Questions
- Has anyone seen this signature — 503/SubStatus 21005 + backend 410 + zero physical partitions in
PhysicalPartitionSizeInfo for one collection only — on Cosmos DB for MongoDB serverless? Is there a known platform-side cause (partition split, internal migration, region-specific incident) that produces this state?
- Anything I can try client-side that won't make recovery harder? My instinct is to not drop or recreate the collection because (a) it might break in-flight platform recovery and (b) the periodic backup is the only path back to the data.
- For a Periodic-backup account (no PITR), what's the realistic restore granularity? My read of the docs is that the smallest restore unit is the whole account into a new account, not a single collection — can anyone confirm whether collection-level restore exists on Periodic in 2026, or whether the account-into-new-account path is still the only option?