Cosmos: Route PKRange resolution through driver#4342
Cosmos: Route PKRange resolution through driver#4342simorenoh wants to merge 8 commits intorelease/azure_data_cosmos-previewsfrom
Conversation
Migrate read_feed_ranges() and feed_range_from_partition_key() in ContainerClient to use the driver's partition key range cache instead of the SDK-layer cache. This is the first step toward removing the SDK's redundant cache infrastructure. Changes: - Add resolve_all_partition_key_ranges() and resolve_partition_key_ranges_for_key() public methods to CosmosDriver - Rewrite ContainerClient::read_feed_ranges() to call driver - Rewrite ContainerClient::feed_range_from_partition_key() to call driver - Remove SDK-specific FeedRange adapter code (from_partition_key_range, from_epk_range) that are no longer needed Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Documents the PR split strategy, remaining deletions, and test coverage gaps for the SDK-to-driver migration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Routes ContainerClient feed-range APIs through the azure_data_cosmos_driver partition key range cache, removing the SDK-layer routing-map adapters as a first step toward deleting redundant SDK caching infrastructure.
Changes:
- Add
CosmosDriver::{resolve_all_partition_key_ranges, resolve_partition_key_ranges_for_key}as public APIs to expose PK range resolution via the driver cache. - Update
ContainerClient::{read_feed_ranges, feed_range_from_partition_key}to delegate PK range resolution to the driver. - Remove SDK-only
FeedRangeadapter helpers that were tied to the SDK routing map types.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| sdk/cosmos/azure_data_cosmos/src/feed_range.rs | Removes SDK-routing-map-specific conversion helpers/tests; keeps driver PartitionKeyRange conversion. |
| sdk/cosmos/azure_data_cosmos/src/clients/container_client.rs | Switches feed-range APIs to call into the driver for PK range resolution. |
| sdk/cosmos/azure_data_cosmos_driver/src/driver/cosmos_driver.rs | Adds public driver methods to resolve all PK ranges or PK ranges for a given partition key. |
Return an error instead of an empty Vec when the driver's PKRange cache resolves to zero ranges after a forced refresh. A valid container always has at least one partition key range, so an empty result indicates a service/auth failure that should not be silently swallowed. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address review findings: 1. Full-key point-lookup miss now returns Some(vec![]) instead of None, allowing the caller's empty-check retry to fire after partition splits. Previously, None short-circuited to an error before the retry logic. 2. Skip redundant retry when force_refresh was already true — avoids hitting the service twice with identical parameters. 3. Log a warning when EPK computation fails instead of silently discarding the error via .ok()?. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A valid container always has at least one partition key range. When the driver's cache falls back to an empty routing map (e.g., service unreachable or parse failure), return None instead of Some(vec![]) so callers get a clear failure signal rather than a silently empty result. This applies to both resolve_all_partition_key_ranges (empty ranges slice) and the full-key branch of resolve_partition_key_ranges_for_key (empty routing map before EPK lookup). The prefix-key branch already returns None via resolve_overlapping_ranges when the underlying map has no entries. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add tests covering the key scenarios for resolve_all/resolve_for_key: - Full key point lookup returns single range - Prefix key (MultiHash) returns overlapping ranges - Empty routing map (service failure) returns empty vec - Fetch failure returns None from try_lookup - Force refresh repopulates after transient empty cache - All ranges returned from try_lookup in correct order These tests validate the cache behavior exercised by CosmosDriver::resolve_all_partition_key_ranges() and resolve_partition_key_ranges_for_key(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
c771c30 to
98b9250
Compare
|
/azp run rust - cosmos - weekly |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| #[allow(dead_code)] | ||
| pub(crate) global_endpoint_manager: Arc<GlobalEndpointManager>, | ||
| #[allow(dead_code)] | ||
| pub(crate) global_partition_endpoint_manager: Arc<GlobalPartitionEndpointManager>, |
There was a problem hiding this comment.
Same here, let's yeet the dead code now, rather than waiting.
| /// Uses the driver's internal `PartitionKeyRangeCache`. When `force_refresh` | ||
| /// is `true`, the cached routing map is refreshed from the service before | ||
| /// returning results. Returns `None` if the routing map cannot be resolved. | ||
| pub async fn resolve_all_partition_key_ranges( |
There was a problem hiding this comment.
We should consider using FeedRange terminology here. I'm going to be moving FeedRange to the driver as part of the Feed Operations, since it needs to be accessible there. I'd argue that at this point, the driver is the only component that needs to see the actual physical partition metadata (PK Range ID) and everything above it can and should work on FeedRanges.
There was a problem hiding this comment.
will be tracked in a follow-up
| .ordered_partition_key_ranges() | ||
|
|
||
| if ranges.is_empty() && !options.force_refresh() { | ||
| // A valid container always has at least one partition key range. |
There was a problem hiding this comment.
Would we even allow an empty range in the cache? Why?
There was a problem hiding this comment.
This is part of the existing logic on the pk range cache that was already merged to the driver, was just accommodating to it here - but you're right this shouldn't happen. Will be tracked in a follow-up.
| /// | ||
| /// Full keys return a single-element `Vec`. Prefix keys on MultiHash | ||
| /// containers return one or more feed ranges. | ||
| pub async fn feed_range_from_partition_key( |
There was a problem hiding this comment.
This name is terrible IMO - maybe get_overlapping_feed_range? Why do we even need this API? We will need one API to produce a FeedRane for a PartitionKey - but just the EPK of teh PK - then a helper function that checks whether FeedRanges are overlapping - maybe whether one feedrange is fully conatined in another - but that is all you need - thsi API gives the wrong impression of being able to identify the "physical partition" for a PK - and that is nothing that ever works reliably - so, we should also nozt give that impression? IMO this APi should just be removed - if customers use it replace it by scalable helper functions we have in Java and .Net as well.
There was a problem hiding this comment.
yep absolutely - and no, we don't need it like this. already discussed offline but this should be what you're describing, not what it is currently. will be tracked in a follow-up.
There was a problem hiding this comment.
Mostly just summarizing offline conversations here for posterity
I think we can land on FeedRange::for_partition_key (this would basically be the only way to create a FeedRange as a user, but it makes sense for it to live on FeedRange since it doesn't involve any I/O, just hashing). That also aligns with the Python behavior (whereas this Rust code is actually incorrect because it does indeed find the physical partition that currently contains that PK; which isn't what we want)
| .collect()) | ||
| if ranges.is_empty() && !options.force_refresh() { | ||
| // Empty result may indicate a stale cache — retry with refresh. | ||
| let ranges = self |
There was a problem hiding this comment.
Like above - when would this ever happen? We should prevent it instead of trying to wroakround it.
There was a problem hiding this comment.
ditto, will be tracked in a follow up
Remove global_endpoint_manager and global_partition_endpoint_manager fields from ClientContext — they were stored but never read through this struct. The GatewayPipeline holds its own Arc references. Remove _database_link parameter from ContainerClient::new() since the driver resolves container metadata by ID directly. Remove DatabaseClient.link field (ResourceLink) which was only used to pass to ContainerClient::new(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The cutover-remaining-work.md is a planning artifact that should stay local rather than in the repository. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
analogrelay
left a comment
There was a problem hiding this comment.
LGTM given the discussed follow-ups
| /// | ||
| /// Full keys return a single-element `Vec`. Prefix keys on MultiHash | ||
| /// containers return one or more feed ranges. | ||
| pub async fn feed_range_from_partition_key( |
There was a problem hiding this comment.
Mostly just summarizing offline conversations here for posterity
I think we can land on FeedRange::for_partition_key (this would basically be the only way to create a FeedRange as a user, but it makes sense for it to live on FeedRange since it doesn't involve any I/O, just hashing). That also aligns with the Python behavior (whereas this Rust code is actually incorrect because it does indeed find the physical partition that currently contains that PK; which isn't what we want)
Migrate
read_feed_ranges()andfeed_range_from_partition_key()inContainerClientto use the driver's partition key range cache instead of the SDK-layer cache. This is the first step toward removing the SDK's redundant cache infrastructure.Changes
cosmos_driver.rs: Addresolve_all_partition_key_ranges()andresolve_partition_key_ranges_for_key()public methods toCosmosDriver, exposing the driver's internal PKRange cache for SDK consumption.container_client.rs: Rewriteread_feed_ranges()andfeed_range_from_partition_key()to delegate to the driver instead of maintaining a separate SDK-layer cache with its own HTTP calls.feed_range.rs: Remove SDK-specific adapter methods (from_partition_key_range,from_epk_range) that are no longer needed now that the driver returns ranges directly.Motivation
The SDK previously maintained its own
PartitionKeyRangeCachethat duplicated the driver's cache. After the PPAF/PPCB cutover (#4156), both caches existed in parallel. This PR routes all PKRange resolution through the driver, making the SDK cache unused and ready for removal in a follow-up PR.Follow-up
A subsequent PR will remove the now-dead SDK-layer cache infrastructure (
partition_key_range_cache.rs,collection_routing_map.rs, old pipeline, retry policies, etc.).