Skip to content

Cosmos: Route PKRange resolution through driver#4342

Open
simorenoh wants to merge 8 commits intorelease/azure_data_cosmos-previewsfrom
pkrange-cache-cutover
Open

Cosmos: Route PKRange resolution through driver#4342
simorenoh wants to merge 8 commits intorelease/azure_data_cosmos-previewsfrom
pkrange-cache-cutover

Conversation

@simorenoh
Copy link
Copy Markdown
Member

Migrate read_feed_ranges() and feed_range_from_partition_key() in ContainerClient to use the driver's partition key range cache instead of the SDK-layer cache. This is the first step toward removing the SDK's redundant cache infrastructure.

Changes

  • cosmos_driver.rs: Add resolve_all_partition_key_ranges() and resolve_partition_key_ranges_for_key() public methods to CosmosDriver, exposing the driver's internal PKRange cache for SDK consumption.
  • container_client.rs: Rewrite read_feed_ranges() and feed_range_from_partition_key() to delegate to the driver instead of maintaining a separate SDK-layer cache with its own HTTP calls.
  • feed_range.rs: Remove SDK-specific adapter methods (from_partition_key_range, from_epk_range) that are no longer needed now that the driver returns ranges directly.

Motivation

The SDK previously maintained its own PartitionKeyRangeCache that duplicated the driver's cache. After the PPAF/PPCB cutover (#4156), both caches existed in parallel. This PR routes all PKRange resolution through the driver, making the SDK cache unused and ready for removal in a follow-up PR.

Follow-up

A subsequent PR will remove the now-dead SDK-layer cache infrastructure (partition_key_range_cache.rs, collection_routing_map.rs, old pipeline, retry policies, etc.).

Migrate read_feed_ranges() and feed_range_from_partition_key() in
ContainerClient to use the driver's partition key range cache instead
of the SDK-layer cache. This is the first step toward removing the
SDK's redundant cache infrastructure.

Changes:
- Add resolve_all_partition_key_ranges() and
  resolve_partition_key_ranges_for_key() public methods to CosmosDriver
- Rewrite ContainerClient::read_feed_ranges() to call driver
- Rewrite ContainerClient::feed_range_from_partition_key() to call driver
- Remove SDK-specific FeedRange adapter code (from_partition_key_range,
  from_epk_range) that are no longer needed

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@simorenoh simorenoh added the Cosmos The azure_cosmos crate label May 4, 2026
@simorenoh simorenoh marked this pull request as ready for review May 4, 2026 22:07
@simorenoh simorenoh requested a review from a team as a code owner May 4, 2026 22:07
Copilot AI review requested due to automatic review settings May 4, 2026 22:07
Documents the PR split strategy, remaining deletions, and test
coverage gaps for the SDK-to-driver migration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Routes ContainerClient feed-range APIs through the azure_data_cosmos_driver partition key range cache, removing the SDK-layer routing-map adapters as a first step toward deleting redundant SDK caching infrastructure.

Changes:

  • Add CosmosDriver::{resolve_all_partition_key_ranges, resolve_partition_key_ranges_for_key} as public APIs to expose PK range resolution via the driver cache.
  • Update ContainerClient::{read_feed_ranges, feed_range_from_partition_key} to delegate PK range resolution to the driver.
  • Remove SDK-only FeedRange adapter helpers that were tied to the SDK routing map types.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
sdk/cosmos/azure_data_cosmos/src/feed_range.rs Removes SDK-routing-map-specific conversion helpers/tests; keeps driver PartitionKeyRange conversion.
sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs Switches feed-range APIs to call into the driver for PK range resolution.
sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs Adds public driver methods to resolve all PK ranges or PK ranges for a given partition key.

Comment thread sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs Outdated
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs
Comment thread sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs
simorenoh and others added 4 commits May 4, 2026 18:24
Return an error instead of an empty Vec when the driver's PKRange
cache resolves to zero ranges after a forced refresh. A valid
container always has at least one partition key range, so an empty
result indicates a service/auth failure that should not be silently
swallowed.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review findings:

1. Full-key point-lookup miss now returns Some(vec![]) instead of None,
   allowing the caller's empty-check retry to fire after partition splits.
   Previously, None short-circuited to an error before the retry logic.

2. Skip redundant retry when force_refresh was already true — avoids
   hitting the service twice with identical parameters.

3. Log a warning when EPK computation fails instead of silently
   discarding the error via .ok()?.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A valid container always has at least one partition key range. When the
driver's cache falls back to an empty routing map (e.g., service
unreachable or parse failure), return None instead of Some(vec![]) so
callers get a clear failure signal rather than a silently empty result.

This applies to both resolve_all_partition_key_ranges (empty ranges
slice) and the full-key branch of resolve_partition_key_ranges_for_key
(empty routing map before EPK lookup). The prefix-key branch already
returns None via resolve_overlapping_ranges when the underlying map
has no entries.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add tests covering the key scenarios for resolve_all/resolve_for_key:
- Full key point lookup returns single range
- Prefix key (MultiHash) returns overlapping ranges
- Empty routing map (service failure) returns empty vec
- Fetch failure returns None from try_lookup
- Force refresh repopulates after transient empty cache
- All ranges returned from try_lookup in correct order

These tests validate the cache behavior exercised by
CosmosDriver::resolve_all_partition_key_ranges() and
resolve_partition_key_ranges_for_key().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@simorenoh simorenoh force-pushed the pkrange-cache-cutover branch from c771c30 to 98b9250 Compare May 5, 2026 00:53
@simorenoh
Copy link
Copy Markdown
Member Author

/azp run rust - cosmos - weekly

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs Outdated
Comment on lines 34 to 37
#[allow(dead_code)]
pub(crate) global_endpoint_manager: Arc<GlobalEndpointManager>,
#[allow(dead_code)]
pub(crate) global_partition_endpoint_manager: Arc<GlobalPartitionEndpointManager>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, let's yeet the dead code now, rather than waiting.

/// Uses the driver's internal `PartitionKeyRangeCache`. When `force_refresh`
/// is `true`, the cached routing map is refreshed from the service before
/// returning results. Returns `None` if the routing map cannot be resolved.
pub async fn resolve_all_partition_key_ranges(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider using FeedRange terminology here. I'm going to be moving FeedRange to the driver as part of the Feed Operations, since it needs to be accessible there. I'd argue that at this point, the driver is the only component that needs to see the actual physical partition metadata (PK Range ID) and everything above it can and should work on FeedRanges.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be tracked in a follow-up

.ordered_partition_key_ranges()

if ranges.is_empty() && !options.force_refresh() {
// A valid container always has at least one partition key range.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we even allow an empty range in the cache? Why?

Copy link
Copy Markdown
Member Author

@simorenoh simorenoh May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the existing logic on the pk range cache that was already merged to the driver, was just accommodating to it here - but you're right this shouldn't happen. Will be tracked in a follow-up.

///
/// Full keys return a single-element `Vec`. Prefix keys on MultiHash
/// containers return one or more feed ranges.
pub async fn feed_range_from_partition_key(
Copy link
Copy Markdown
Member

@FabianMeiswinkel FabianMeiswinkel May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is terrible IMO - maybe get_overlapping_feed_range? Why do we even need this API? We will need one API to produce a FeedRane for a PartitionKey - but just the EPK of teh PK - then a helper function that checks whether FeedRanges are overlapping - maybe whether one feedrange is fully conatined in another - but that is all you need - thsi API gives the wrong impression of being able to identify the "physical partition" for a PK - and that is nothing that ever works reliably - so, we should also nozt give that impression? IMO this APi should just be removed - if customers use it replace it by scalable helper functions we have in Java and .Net as well.

Copy link
Copy Markdown
Member Author

@simorenoh simorenoh May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep absolutely - and no, we don't need it like this. already discussed offline but this should be what you're describing, not what it is currently. will be tracked in a follow-up.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just summarizing offline conversations here for posterity

I think we can land on FeedRange::for_partition_key (this would basically be the only way to create a FeedRange as a user, but it makes sense for it to live on FeedRange since it doesn't involve any I/O, just hashing). That also aligns with the Python behavior (whereas this Rust code is actually incorrect because it does indeed find the physical partition that currently contains that PK; which isn't what we want)

.collect())
if ranges.is_empty() && !options.force_refresh() {
// Empty result may indicate a stale cache — retry with refresh.
let ranges = self
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above - when would this ever happen? We should prevent it instead of trying to wroakround it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, will be tracked in a follow up

@github-project-automation github-project-automation Bot moved this from Todo to Changes Requested in CosmosDB Go/Rust Crew May 5, 2026
simorenoh and others added 2 commits May 6, 2026 10:38
Remove global_endpoint_manager and global_partition_endpoint_manager
fields from ClientContext — they were stored but never read through
this struct. The GatewayPipeline holds its own Arc references.

Remove _database_link parameter from ContainerClient::new() since
the driver resolves container metadata by ID directly.

Remove DatabaseClient.link field (ResourceLink) which was only used
to pass to ContainerClient::new().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cutover-remaining-work.md is a planning artifact that should
stay local rather than in the repository.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Member

@analogrelay analogrelay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given the discussed follow-ups

///
/// Full keys return a single-element `Vec`. Prefix keys on MultiHash
/// containers return one or more feed ranges.
pub async fn feed_range_from_partition_key(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just summarizing offline conversations here for posterity

I think we can land on FeedRange::for_partition_key (this would basically be the only way to create a FeedRange as a user, but it makes sense for it to live on FeedRange since it doesn't involve any I/O, just hashing). That also aligns with the Python behavior (whereas this Rust code is actually incorrect because it does indeed find the physical partition that currently contains that PK; which isn't what we want)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Cosmos The azure_cosmos crate Created by copilot

Projects

Status: Changes Requested

Development

Successfully merging this pull request may close these issues.

Cosmos Driver: Switch read_feed_ranges / feed_range_from_partition_key to driver routing map

4 participants