*: fix reload tls certs in reconnect and fix region cache by yongman · Pull Request #537 · tikv/client-rust

yongman · 2026-05-09T10:31:44Z

Deep Review

Problem Summary

The branch is trying to fix two long-lived cache problems:
TLS materials were effectively pinned into cached TiKV gRPC clients, so cert/key rotation did not take effect until process restart or a full client rebuild.
Region-to-store mapping could get stuck on stale cache state, especially when the cached region pointed at a store entry that could no longer be resolved.
The implementation addresses this by reloading TLS files on connection creation, attaching a TLS-dependent cache key to cached TiKV clients, and retrying more store-mapping failures through the existing region backoff path.

Solution Walkthrough

src/common/security.rs replaces in-memory ca/cert bytes with stored file paths. SecurityManager::connect() now reloads the CA, cert, and key from disk every time it builds a new TLS Endpoint, which is what enables rotated files to affect future connections.
src/common/security.rs also adds connection_cache_key(), which hashes per-file metadata from file_signature() and returns None when TLS is disabled. The intended use is to tell whether a cached TiKV client was created from the same TLS inputs as the current filesystem state.
src/store/client.rs extends KvConnect with an optional connection_cache_key() hook. TikvConnect forwards that call to SecurityManager, so the TiKV transport layer can participate in TLS-aware cache invalidation without changing non-TLS connectors.
src/pd/client.rs changes kv_client_cache from HashMap<String, KvClient> to HashMap<String, CachedKvClient<_>>. PdRpcClient::kv_client() now reads the current TLS cache key, compares it with the cached entry for the target store, and reconnects when the key changes.
src/request/plan.rs broadens retry handling in RetryableMultiRegion::single_shard_handler(). Previously only LeaderNotFound during map_region_to_store() went through handle_other_error(). Now any failure from map_region_to_store() or plan.apply_store() invalidates region cache state, consumes region backoff, and retries the full shard mapping flow.
src/request/mod.rs and src/pd/client.rs add tests for the new retry path and cache-key-sensitive TiKV client reuse. src/common/security.rs adds a reload test that proves file rereads and cache-key changes when file contents change.
.cargo/config.toml adds resolver.incompatible-rust-versions = "fallback" so Cargo can resolve dependencies on older toolchains in CI.

Findings (ordered by severity)

Medium: TLS rotation detection is based only on (file length, modified time). A renewed PEM with the same length and an unchanged or coarse-grained mtime will produce the same cache key, so the cached Channel keeps using the old TLS identity indefinitely. This undercuts the main goal of the patch, and the added test only covers size-changing rewrites, so the blind spot is untested. Refs: src/common/security.rs:83-92, src/common/security.rs:161-169, src/common/security.rs:205-241.
It's enough use file length and the file mtime as fingerprint.

Costs and Negative Impacts

Correctness: The region-cache retry fix in src/request/plan.rs:166-189 is directionally correct for stale store mapping, but the TLS cache-key design introduces new failure cases where cert-file metadata availability, rather than connection health, decides whether a request can proceed.
Security: Reloading TLS materials per new connection is the right security direction because rotated credentials can take effect without a process restart. The risk is that metadata-only change detection can silently keep using an old client identity after rotation.
Compatibility: RetryableMultiRegion now retries any map_region_to_store() / apply_store() error instead of only LeaderNotFound. For in-tree request types that mostly means more aggressive recovery, but it is still a behavioral change for any downstream Shardable implementation that used apply_store() errors as terminal failures. Refs: src/request/plan.rs:168-189.
Robustness: The new retry path will also retry local TLS/configuration failures after invalidating region cache state, which adds PD churn and delays surfacing deterministic local errors.
Cognitive Load: The TiKV client cache is now coupled to TLS-file metadata and request retry behavior. Future maintainers need to reason about connection reuse, secret rotation, and async hot-path filesystem access together.
CPU: Extra hashing work is minor, but it is now paid on every TiKV client lookup.
Memory: Negligible; cached entries now carry one additional Option<u64>.
Log Volume: No major new log volume, though repeated retry loops on local TLS-file errors will amplify existing retry/debug logs.

Engineering Rules Check

No repo-local AGENTS.md was found under /home/dev/workspace/client-rust, so there were no repository-specific engineering rules to validate beyond the code itself.

Questions and Assumptions

I assumed the review target is origin/HEAD...HEAD because no explicit diff or PR metadata was provided.
I assumed the intended behavior for TLS rotation is “use existing healthy channels until a refresh is required, then rebuild with new materials,” because that preserves availability while still honoring rotated credentials.
I did not find external PR text, so the problem statement above is inferred from the commit messages bd2c601, 5e975b0, and 29f4ff1.

Suggested Tests / Validation

Add a TLS reload test where file contents change but file length does not, and force the file timestamps to remain equal if the platform allows it. That covers the current metadata-signature blind spot.
Add a TiKV client cache test where connection_cache_key() temporarily returns an error after a client has already been cached; verify whether requests should reuse the cached client or fail closed.
Add a benchmark or targeted perf test for repeated kv_client() cache hits under TLS to measure the cost of the per-request metadata reads.
Validation run: cargo test --lib passed locally on May 12, 2026.

Signed-off-by: Ray Yan <yming0221@gmail.com>

ti-chi-bot · 2026-05-09T10:31:48Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign marsishandsome for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-09T10:31:51Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 495f2c34-4f3f-4d69-83b4-c8c4ea77634a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Ray Yan <yming0221@gmail.com>

*: fix reload tls certs in reconnect and fix region cache

bd2c601

Signed-off-by: Ray Yan <yming0221@gmail.com>

ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 9, 2026

ti-chi-bot Bot added contribution This PR is from a community contributor. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 9, 2026

support client cache in tls

5e975b0

Signed-off-by: Ray Yan <yming0221@gmail.com>

yongman force-pushed the ym/fix-tls-and-cache branch from 9e8125c to ea87c9d Compare May 12, 2026 03:23

fix ci

29f4ff1

Signed-off-by: Ray Yan <yming0221@gmail.com>

yongman force-pushed the ym/fix-tls-and-cache branch from ea87c9d to b2c2777 Compare May 12, 2026 08:39

ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 12, 2026

fix findings

7c798eb

Signed-off-by: Ray Yan <yming0221@gmail.com>

yongman force-pushed the ym/fix-tls-and-cache branch from b2c2777 to 7c798eb Compare May 12, 2026 10:07

ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: fix reload tls certs in reconnect and fix region cache#537

*: fix reload tls certs in reconnect and fix region cache#537
yongman wants to merge 4 commits into
tikv:masterfrom
yongman:ym/fix-tls-and-cache

yongman commented May 9, 2026 •

edited

Loading

Uh oh!

ti-chi-bot Bot commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yongman commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review

Problem Summary

Solution Walkthrough

Findings (ordered by severity)

Costs and Negative Impacts

Engineering Rules Check

Questions and Assumptions

Suggested Tests / Validation

Uh oh!

ti-chi-bot Bot commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yongman commented May 9, 2026 •

edited

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading