Share via

Single Cosmos DB for MongoDB collection unreachable — 503/SubStatus 21005 on management plane, MongoDB error 50 (MaxTimeMSExpired) on data plane, sibling collections healthy

Grote, Marcel 0 Reputation points
2026-05-05T11:08:31.36+00:00

Single Cosmos DB for MongoDB collection unreachable — 503/SubStatus 21005 on management plane, MongoDB error 50 (MaxTimeMSExpired) on data plane, sibling collections healthy

TL;DR

On a serverless Azure Cosmos DB for MongoDB account in West Europe, exactly one collection is unreachable on every control surface (portal, ARM, az CLI, pymongo) while every other collection in the same database responds normally. Account-level health is green (0 throttles, 0% normalized RU). Looks partition- or metadata-scoped to the one collection. I have a support request open in parallel; posting here in case anyone has seen this signature or can sanity-check the diagnosis.

Environment

  • API: Azure Cosmos DB for MongoDB
  • Capacity mode: Serverless (EnableMongo + EnableServerless)
  • Region: West Europe, single region, no multi-write
  • Backup policy: Periodic, 4h interval, 8h retention, Geo-redundant (no PITR)
  • Affected collection: mails
  • Driver: pymongo 4.13.1, AsyncMongoClient
  • Connection settings: serverSelectionTimeoutMS=60000, connectTimeoutMS=60000, socketTimeoutMS=120000, maxIdleTimeMS=120000, retryWrites=False, minPoolSize=1
  • TLS: default (TLS 1.2)

Symptoms

1. Management plane — fails

GET /dbs/<db-name>/colls/mails:

HTTP 503 ServiceUnavailable, SubStatus 21005
backend: "Server returned a 410 response."
36.6s total across 3 retries (515ms / 5008ms cancelled / 30065ms timeout)

az cosmosdb mongodb collection throughput show -n mails:

HTTP 504 GatewayTimeout
"Gateway did not receive a response from Microsoft.DocumentDB
 within the specified time period."

Azure portal cannot open the collection (same backend error).

2. Data plane — fails

find_one_and_update upserts on mails time out server-side with MongoDB error code 50 (MaxTimeMSExpired). pymongo surfaces it as NetworkTimeout after the 60s deadline:

<account>-westeurope.mongo.cosmos.azure.com:10255: timed out
(configured timeoutMS: 60000.0ms, connectTimeoutMS: 60000.0ms)

3. Telemetry plane — degraded

Azure Monitor returns no per-CollectionName values for mails on DocumentCount, DataUsage, or PhysicalPartitionSizeInfo, while every other collection in the same database reports normally. PhysicalPartitionSizeInfo filtered to CollectionName='mails' returns "Physical partitions found: 0".

4. Sibling collections — fine

All other collections respond normally on both management and data plane.

ActivityIds

  • Management-plane GET (503/21005): 5671c1bb-7182-4ae4-ba21-3d3ca6e962db
  • Outer correlation: 24e9898a-4852-11f1-9358-2615164c6ab5

Per Azure Monitor MongoRequests, the last three successful operations on mails were on 2026-04-30 (UTC) — that's the before/after window for whatever changed on the partition.

Collection shape (what I know without being able to read it)

  • Default _id shard key — no custom shard key was ever set by our code.
  • Volume is tiny: 6 ops in the 7 days before the incident (3 successful upserts on 2026-04-30, then nothing until 3 timeouts on 2026-05-04). Documents are ~1 KB each (one MJML email body + metadata).
  • For comparison the largest collection on the account is 8.66 MB / 461 documents. Nothing here is partition-pressure-bound.

Timeline (UTC)

Date Event
2026-02-06 Account created
2026-04-30 mails upserts succeeded, 3/3 in MongoRequests
2026-05-01 → 05-03 No mails traffic
2026-05-04 07:00 First MongoRequests ErrorCode 50 on mails
2026-05-04 10:00 Second
2026-05-04 13:00 Third (matched application alert at 13:51 UTC)
2026-05-04 → 100% failure on mails; management-plane GET also fails

Account-level health (green)

  • TotalRequests StatusCode=429: 0 across the full day
  • NormalizedRUConsumption peak: 0%
  • All other collections: management reads <1s, data-plane writes succeed.

Hypothesis

Two readings of the telemetry both point to a fault scoped to the single collection:

  1. Degraded physical partition hosting mails — backend returns 410 to the gateway, surfacing as 503/SubStatus 21005 on the control plane and as MaxTimeMSExpired on the data plane.
  2. Orphaned/corrupted collection metadata — consistent with PhysicalPartitionSizeInfo reporting zero partitions for mails while siblings report normally; nothing healthy to route to.

Either way the blast radius is one collection on one account. Staging and dev (different subscriptions, different accounts) are unaffected.

Questions

  1. Has anyone seen this signature — 503/SubStatus 21005 + backend 410 + zero physical partitions in PhysicalPartitionSizeInfo for one collection only — on Cosmos DB for MongoDB serverless? Is there a known platform-side cause (partition split, internal migration, region-specific incident) that produces this state?
  2. Anything I can try client-side that won't make recovery harder? My instinct is to not drop or recreate the collection because (a) it might break in-flight platform recovery and (b) the periodic backup is the only path back to the data.
  3. For a Periodic-backup account (no PITR), what's the realistic restore granularity? My read of the docs is that the smallest restore unit is the whole account into a new account, not a single collection — can anyone confirm whether collection-level restore exists on Periodic in 2026, or whether the account-into-new-account path is still the only option?
Azure Cosmos DB
Azure Cosmos DB

An Azure NoSQL database service for app development.


1 answer

Sort by: Most helpful
  1. Q&A Assist
    2026-05-05T11:08:45.24+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more
    1. The context does not describe this exact signature (single Cosmos DB for MongoDB collection unreachable with 503/SubStatus 21005, backend 410, and Mongo error 50 while sibling collections are healthy), so no platform-side root cause or known incident pattern can be confirmed from it.
    2. The context does describe high-level disaster recovery behavior for Azure Cosmos DB accounts by configuration. For a single-region account, the recommended actions during an outage are:
      • Wait for service restoration while monitoring Service Health and Resource Health.
      • If the outage exceeds the recovery time objective, request an account restore from backup to another region via Azure Support.
      • Plan for multi-region deployment to avoid future single-region outages.
      For multi-region accounts, the platform can automatically route around regional outages or perform failover, but this does not change the fact that the issue described is scoped to one collection and is not covered in the context. Given that the context does not provide any guidance for a collection-level fault of this type, there is no supported client-side action in the material that can be recommended beyond opening and working with Azure Support, and (for accounts with multiple regions) using documented failover mechanisms when appropriate.
    3. Regarding restore granularity with periodic backup: the context only states that for any account configuration and for data corruption or accidental deletion, the recovery options are:
      • Point-in-time restore (continuous backup), or
      • Restore from periodic backup.
      The detailed ARM templates in the context show how to specify createMode and restoreParameters for MongoDB collections, including restoreSource and restoreTimestampInUtc, but they do not explicitly state whether, under periodic backup, the supported operational granularity in 2026 is account-level only or whether single-collection restore is exposed as a managed operation. The context therefore does not allow confirming or denying collection-level restore support for periodic backup in 2026.

    Because of these gaps, the only fully supported guidance from the context is to treat this as a data corruption/outage scenario and work with Azure Support to restore from backup or use the documented disaster recovery paths at the account level.


    References:

    Was this answer helpful?

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.