*: fix reload tls certs in reconnect and fix region cache#537
Conversation
Signed-off-by: Ray Yan <yming0221@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Ray Yan <yming0221@gmail.com>
9e8125c to
ea87c9d
Compare
ea87c9d to
b2c2777
Compare
Signed-off-by: Ray Yan <yming0221@gmail.com>
b2c2777 to
7c798eb
Compare
Deep Review
Problem Summary
Solution Walkthrough
src/common/security.rsreplaces in-memoryca/certbytes with stored file paths.SecurityManager::connect()now reloads the CA, cert, and key from disk every time it builds a new TLSEndpoint, which is what enables rotated files to affect future connections.src/common/security.rsalso addsconnection_cache_key(), which hashes per-file metadata fromfile_signature()and returnsNonewhen TLS is disabled. The intended use is to tell whether a cached TiKV client was created from the same TLS inputs as the current filesystem state.src/store/client.rsextendsKvConnectwith an optionalconnection_cache_key()hook.TikvConnectforwards that call toSecurityManager, so the TiKV transport layer can participate in TLS-aware cache invalidation without changing non-TLS connectors.src/pd/client.rschangeskv_client_cachefromHashMap<String, KvClient>toHashMap<String, CachedKvClient<_>>.PdRpcClient::kv_client()now reads the current TLS cache key, compares it with the cached entry for the target store, and reconnects when the key changes.src/request/plan.rsbroadens retry handling inRetryableMultiRegion::single_shard_handler(). Previously onlyLeaderNotFoundduringmap_region_to_store()went throughhandle_other_error(). Now any failure frommap_region_to_store()orplan.apply_store()invalidates region cache state, consumes region backoff, and retries the full shard mapping flow.src/request/mod.rsandsrc/pd/client.rsadd tests for the new retry path and cache-key-sensitive TiKV client reuse.src/common/security.rsadds a reload test that proves file rereads and cache-key changes when file contents change..cargo/config.tomladdsresolver.incompatible-rust-versions = "fallback"so Cargo can resolve dependencies on older toolchains in CI.Findings (ordered by severity)
Medium: TLS rotation detection is based only on(file length, modified time). A renewed PEM with the same length and an unchanged or coarse-grained mtime will produce the same cache key, so the cachedChannelkeeps using the old TLS identity indefinitely. This undercuts the main goal of the patch, and the added test only covers size-changing rewrites, so the blind spot is untested. Refs:src/common/security.rs:83-92,src/common/security.rs:161-169,src/common/security.rs:205-241.It's enough use file length and the file mtime as fingerprint.
Costs and Negative Impacts
src/request/plan.rs:166-189is directionally correct for stale store mapping, but the TLS cache-key design introduces new failure cases where cert-file metadata availability, rather than connection health, decides whether a request can proceed.RetryableMultiRegionnow retries anymap_region_to_store()/apply_store()error instead of onlyLeaderNotFound. For in-tree request types that mostly means more aggressive recovery, but it is still a behavioral change for any downstreamShardableimplementation that usedapply_store()errors as terminal failures. Refs:src/request/plan.rs:168-189.Option<u64>.Engineering Rules Check
AGENTS.mdwas found under/home/dev/workspace/client-rust, so there were no repository-specific engineering rules to validate beyond the code itself.Questions and Assumptions
origin/HEAD...HEADbecause no explicit diff or PR metadata was provided.bd2c601,5e975b0, and29f4ff1.Suggested Tests / Validation
connection_cache_key()temporarily returns an error after a client has already been cached; verify whether requests should reuse the cached client or fail closed.kv_client()cache hits under TLS to measure the cost of the per-request metadata reads.cargo test --libpassed locally on May 12, 2026.