diff --git a/CODE_OF_CONDUCT.md b/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..84c5f77 --- /dev/null +++ b/CODE_OF_CONDUCT.md @@ -0,0 +1,208 @@ +# Code of Conduct + +## 1. Purpose + +Telemt exists to solve technical problems. + +Telemt is open to contributors who want to learn, improve and build meaningful systems together. + +It is a place for building, testing, reasoning, documenting, and improving systems. + +Discussions that advance this work are in scope. Discussions that divert it are not. + +Technology has consequences. Responsibility is inherent. + +> **Zweck bestimmt die Form.** + +> Purpose defines form. + +--- + +## 2. Principles + +* **Technical over emotional** + Arguments are grounded in data, logs, reproducible cases, or clear reasoning. + +* **Clarity over noise** + Communication is structured, concise, and relevant. + +* **Openness with standards** + Participation is open. The work remains disciplined. + +* **Independence of judgment** + Claims are evaluated on technical merit, not affiliation or posture. + +* **Responsibility over capability** + Capability does not justify careless use. + +* **Cooperation over friction** + Progress depends on coordination, mutual support, and honest review. + +* **Good intent, rigorous method** + Assume good intent, but require rigor. + +> **Aussagen gelten nach ihrer Begründung.** + +> Claims are weighed by evidence. + +--- + +## 3. Expected Behavior + +Participants are expected to: + +* Communicate directly and respectfully +* Support claims with evidence +* Stay within technical scope +* Accept critique and provide it constructively +* Reduce noise, duplication, and ambiguity +* Help others reach correct and reproducible outcomes +* Act in a way that improves the system as a whole + +Precision is learned. + +New contributors are welcome. They are expected to grow into these standards. Existing contributors are expected to make that growth possible. + +> **Wer behauptet, belegt.** + +> Whoever claims, proves. + +--- + +## 4. Unacceptable Behavior + +The following is not allowed: + +* Personal attacks, insults, harassment, or intimidation +* Repeatedly derailing discussion away from Telemt’s purpose +* Spam, flooding, or repeated low-quality input +* Misinformation presented as fact +* Attempts to degrade, destabilize, or exhaust Telemt or its participants +* Use of Telemt or its spaces to enable harm + +Telemt is not a venue for disputes that displace technical work. +Such discussions may be closed, removed, or redirected. + +> **Störung ist kein Beitrag.** + +> Disruption is not contribution. + +--- + +## 5. Security and Misuse + +Telemt is intended for responsible use. + +* Do not use it to plan, coordinate, or execute harm +* Do not publish vulnerabilities without responsible disclosure +* Report security issues privately where possible + +Security is both technical and behavioral. + +> **Verantwortung endet nicht am Code.** + +> Responsibility does not end at the code. + +--- + +## 6. Openness + +Telemt is open to contributors of different backgrounds, experience levels, and working styles. + +Standards are public, legible, and applied to the work itself. + +Questions are welcome. Careful disagreement is welcome. Honest correction is welcome. + +Gatekeeping by obscurity, status signaling, or hostility is not. + +--- + +## 7. Scope + +This Code of Conduct applies to all official spaces: + +* Source repositories (issues, pull requests, discussions) +* Documentation +* Communication channels associated with Telemt + +--- + +## 8. Maintainer Stewardship + +Maintainers are responsible for final decisions in matters of conduct, scope, and direction. + +This responsibility is stewardship: preserving continuity, protecting signal, maintaining standards, and keeping Telemt workable for others. + +Judgment should be exercised with restraint, consistency, and institutional responsibility. + +Not every decision requires extended debate. +Not every intervention requires public explanation. + +All decisions are expected to serve the durability, clarity, and integrity of Telemt. + +> **Ordnung ist Voraussetzung der Funktion.** + +> Order is the precondition of function. + +--- + +## 9. Enforcement + +Maintainers may act to preserve the integrity of Telemt, including by: + +* Removing content +* Locking discussions +* Rejecting contributions +* Restricting or banning participants + +Actions are taken to maintain function, continuity, and signal quality. + +Where possible, correction is preferred to exclusion. + +Where necessary, exclusion is preferred to decay. + +--- + +## 10. Final + +Telemt is built on discipline, structure, and shared intent. + +Signal over noise. +Facts over opinion. +Systems over rhetoric. + +Work is collective. +Outcomes are shared. +Responsibility is distributed. + +Precision is learned. +Rigor is expected. +Help is part of the work. + +> **Ordnung ist Voraussetzung der Freiheit.** + +If you contribute — contribute with care. +If you speak — speak with substance. +If you engage — engage constructively. + +--- + +## 11. After All + +Systems outlive intentions. + +What is built will be used. +What is released will propagate. +What is maintained will define the future state. + +There is no neutral infrastructure, only infrastructure shaped well or poorly. + +> **Jedes System trägt Verantwortung.** + +> Every system carries responsibility. + +Stability requires discipline. +Freedom requires structure. +Trust requires honesty. + +In the end, the system reflects its contributors. diff --git a/Cargo.toml b/Cargo.toml index a47a4e5..5b4e32f 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,6 +1,6 @@ [package] name = "telemt" -version = "3.3.20" +version = "3.3.23" edition = "2024" [dependencies] diff --git a/docs/CONFIG_PARAMS.en.md b/docs/CONFIG_PARAMS.en.md new file mode 100644 index 0000000..90da08a --- /dev/null +++ b/docs/CONFIG_PARAMS.en.md @@ -0,0 +1,294 @@ +# Telemt Config Parameters Reference + +This document lists all configuration keys accepted by `config.toml`. + +> [!WARNING] +> +> The configuration parameters detailed in this document are intended for advanced users and fine-tuning purposes. Modifying these settings without a clear understanding of their function may lead to application instability or other unexpected behavior. Please proceed with caution and at your own risk. + +## Top-level keys + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| include | `String` (special directive) | `null` | — | Includes another TOML file with `include = "relative/or/absolute/path.toml"`; includes are processed recursively before parsing. | +| show_link | `"*" \| String[]` | `[]` (`ShowLink::None`) | — | Legacy top-level link visibility selector (`"*"` for all users or explicit usernames list). | +| dc_overrides | `Map` | `{}` | — | Overrides DC endpoints for non-standard DCs; key is DC id string, value is `ip:port` list. | +| default_dc | `u8 \| null` | `null` (effective fallback: `2` in ME routing) | — | Default DC index used for unmapped non-standard DCs. | + +## [general] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| data_path | `String \| null` | `null` | — | Optional runtime data directory path. | +| prefer_ipv6 | `bool` | `false` | — | Prefer IPv6 where applicable in runtime logic. | +| fast_mode | `bool` | `true` | — | Enables fast-path optimizations for traffic processing. | +| use_middle_proxy | `bool` | `true` | none | Enables ME transport mode; if `false`, runtime falls back to direct DC routing. | +| proxy_secret_path | `String \| null` | `"proxy-secret"` | Path may be `null`. | Path to Telegram infrastructure proxy-secret file used by ME handshake logic. | +| proxy_config_v4_cache_path | `String \| null` | `"cache/proxy-config-v4.txt"` | — | Optional cache path for raw `getProxyConfig` (IPv4) snapshot. | +| proxy_config_v6_cache_path | `String \| null` | `"cache/proxy-config-v6.txt"` | — | Optional cache path for raw `getProxyConfigV6` (IPv6) snapshot. | +| ad_tag | `String \| null` | `null` | — | Global fallback ad tag (32 hex characters). | +| middle_proxy_nat_ip | `IpAddr \| null` | `null` | Must be a valid IP when set. | Manual public NAT IP override used as ME address material when set. | +| middle_proxy_nat_probe | `bool` | `true` | Auto-forced to `true` when `use_middle_proxy = true`. | Enables ME NAT probing; runtime may force it on when ME mode is active. | +| middle_proxy_nat_stun | `String \| null` | `null` | Deprecated. Use `network.stun_servers`. | Deprecated legacy single STUN server for NAT probing. | +| middle_proxy_nat_stun_servers | `String[]` | `[]` | Deprecated. Use `network.stun_servers`. | Deprecated legacy STUN list for NAT probing fallback. | +| stun_nat_probe_concurrency | `usize` | `8` | Must be `> 0`. | Maximum number of parallel STUN probes during NAT/public endpoint discovery. | +| middle_proxy_pool_size | `usize` | `8` | none | Target size of active ME writer pool. | +| middle_proxy_warm_standby | `usize` | `16` | none | Reserved compatibility field in current runtime revision. | +| me_init_retry_attempts | `u32` | `0` | `0..=1_000_000`. | Startup retries for ME pool initialization (`0` means unlimited). | +| me2dc_fallback | `bool` | `true` | — | Allows fallback from ME mode to direct DC when ME startup fails. | +| me_keepalive_enabled | `bool` | `true` | none | Enables periodic ME keepalive/ping traffic. | +| me_keepalive_interval_secs | `u64` | `8` | none | Base ME keepalive interval in seconds. | +| me_keepalive_jitter_secs | `u64` | `2` | none | Keepalive jitter in seconds to reduce synchronized bursts. | +| me_keepalive_payload_random | `bool` | `true` | none | Randomizes keepalive payload bytes instead of fixed zero payload. | +| rpc_proxy_req_every | `u64` | `0` | `0` or `10..=300`. | Interval for service `RPC_PROXY_REQ` activity signals (`0` disables). | +| me_writer_cmd_channel_capacity | `usize` | `4096` | Must be `> 0`. | Capacity of per-writer command channel. | +| me_route_channel_capacity | `usize` | `768` | Must be `> 0`. | Capacity of per-connection ME response route channel. | +| me_c2me_channel_capacity | `usize` | `1024` | Must be `> 0`. | Capacity of per-client command queue (client reader -> ME sender). | +| me_reader_route_data_wait_ms | `u64` | `2` | `0..=20`. | Bounded wait for routing ME DATA to per-connection queue (`0` = no wait). | +| me_d2c_flush_batch_max_frames | `usize` | `32` | `1..=512`. | Max ME->client frames coalesced before flush. | +| me_d2c_flush_batch_max_bytes | `usize` | `131072` | `4096..=2_097_152`. | Max ME->client payload bytes coalesced before flush. | +| me_d2c_flush_batch_max_delay_us | `u64` | `500` | `0..=5000`. | Max microsecond wait for coalescing more ME->client frames (`0` disables timed coalescing). | +| me_d2c_ack_flush_immediate | `bool` | `true` | — | Flushes client writer immediately after quick-ack write. | +| direct_relay_copy_buf_c2s_bytes | `usize` | `65536` | `4096..=1_048_576`. | Copy buffer size for client->DC direction in direct relay. | +| direct_relay_copy_buf_s2c_bytes | `usize` | `262144` | `8192..=2_097_152`. | Copy buffer size for DC->client direction in direct relay. | +| crypto_pending_buffer | `usize` | `262144` | — | Max pending ciphertext buffer per client writer (bytes). | +| max_client_frame | `usize` | `16777216` | — | Maximum allowed client MTProto frame size (bytes). | +| desync_all_full | `bool` | `false` | — | Emits full crypto-desync forensic logs for every event. | +| beobachten | `bool` | `true` | — | Enables per-IP forensic observation buckets. | +| beobachten_minutes | `u64` | `10` | Must be `> 0`. | Retention window (minutes) for per-IP observation buckets. | +| beobachten_flush_secs | `u64` | `15` | Must be `> 0`. | Snapshot flush interval (seconds) for observation output file. | +| beobachten_file | `String` | `"cache/beobachten.txt"` | — | Observation snapshot output file path. | +| hardswap | `bool` | `true` | none | Enables generation-based ME hardswap strategy. | +| me_warmup_stagger_enabled | `bool` | `true` | none | Staggers extra ME warmup dials to avoid connection spikes. | +| me_warmup_step_delay_ms | `u64` | `500` | none | Base delay in milliseconds between warmup dial steps. | +| me_warmup_step_jitter_ms | `u64` | `300` | none | Additional random delay in milliseconds for warmup steps. | +| me_reconnect_max_concurrent_per_dc | `u32` | `8` | none | Limits concurrent reconnect workers per DC during health recovery. | +| me_reconnect_backoff_base_ms | `u64` | `500` | none | Initial reconnect backoff in milliseconds. | +| me_reconnect_backoff_cap_ms | `u64` | `30000` | none | Maximum reconnect backoff cap in milliseconds. | +| me_reconnect_fast_retry_count | `u32` | `16` | none | Immediate retry budget before long backoff behavior applies. | +| me_single_endpoint_shadow_writers | `u8` | `2` | `0..=32`. | Additional reserve writers for one-endpoint DC groups. | +| me_single_endpoint_outage_mode_enabled | `bool` | `true` | — | Enables aggressive outage recovery for one-endpoint DC groups. | +| me_single_endpoint_outage_disable_quarantine | `bool` | `true` | — | Ignores endpoint quarantine in one-endpoint outage mode. | +| me_single_endpoint_outage_backoff_min_ms | `u64` | `250` | Must be `> 0`; also `<= me_single_endpoint_outage_backoff_max_ms`. | Minimum reconnect backoff in outage mode (ms). | +| me_single_endpoint_outage_backoff_max_ms | `u64` | `3000` | Must be `> 0`; also `>= me_single_endpoint_outage_backoff_min_ms`. | Maximum reconnect backoff in outage mode (ms). | +| me_single_endpoint_shadow_rotate_every_secs | `u64` | `900` | — | Periodic shadow writer rotation interval (`0` disables). | +| me_floor_mode | `"static" \| "adaptive"` | `"adaptive"` | — | Writer floor policy mode. | +| me_adaptive_floor_idle_secs | `u64` | `90` | — | Idle time before adaptive floor may reduce one-endpoint target. | +| me_adaptive_floor_min_writers_single_endpoint | `u8` | `1` | `1..=32`. | Minimum adaptive writer target for one-endpoint DC groups. | +| me_adaptive_floor_min_writers_multi_endpoint | `u8` | `1` | `1..=32`. | Minimum adaptive writer target for multi-endpoint DC groups. | +| me_adaptive_floor_recover_grace_secs | `u64` | `180` | — | Grace period to hold static floor after activity. | +| me_adaptive_floor_writers_per_core_total | `u16` | `48` | Must be `> 0`. | Global writer budget per logical CPU core in adaptive mode. | +| me_adaptive_floor_cpu_cores_override | `u16` | `0` | — | Manual CPU core count override (`0` uses auto-detection). | +| me_adaptive_floor_max_extra_writers_single_per_core | `u16` | `1` | — | Per-core max extra writers above base floor for one-endpoint DCs. | +| me_adaptive_floor_max_extra_writers_multi_per_core | `u16` | `2` | — | Per-core max extra writers above base floor for multi-endpoint DCs. | +| me_adaptive_floor_max_active_writers_per_core | `u16` | `64` | Must be `> 0`. | Hard cap for active ME writers per logical CPU core. | +| me_adaptive_floor_max_warm_writers_per_core | `u16` | `64` | Must be `> 0`. | Hard cap for warm ME writers per logical CPU core. | +| me_adaptive_floor_max_active_writers_global | `u32` | `256` | Must be `> 0`. | Hard global cap for active ME writers. | +| me_adaptive_floor_max_warm_writers_global | `u32` | `256` | Must be `> 0`. | Hard global cap for warm ME writers. | +| upstream_connect_retry_attempts | `u32` | `2` | Must be `> 0`. | Connect attempts for selected upstream before error/fallback. | +| upstream_connect_retry_backoff_ms | `u64` | `100` | — | Delay between upstream connect attempts (ms). | +| upstream_connect_budget_ms | `u64` | `3000` | Must be `> 0`. | Total wall-clock budget for one upstream connect request (ms). | +| upstream_unhealthy_fail_threshold | `u32` | `5` | Must be `> 0`. | Consecutive failed requests before upstream is marked unhealthy. | +| upstream_connect_failfast_hard_errors | `bool` | `false` | — | Skips additional retries for hard non-transient connect errors. | +| stun_iface_mismatch_ignore | `bool` | `false` | none | Reserved compatibility flag in current runtime revision. | +| unknown_dc_log_path | `String \| null` | `"unknown-dc.txt"` | — | File path for unknown-DC request logging (`null` disables file path). | +| unknown_dc_file_log_enabled | `bool` | `false` | — | Enables unknown-DC file logging. | +| log_level | `"debug" \| "verbose" \| "normal" \| "silent"` | `"normal"` | — | Runtime logging verbosity. | +| disable_colors | `bool` | `false` | — | Disables ANSI colors in logs. | +| me_socks_kdf_policy | `"strict" \| "compat"` | `"strict"` | — | SOCKS-bound KDF fallback policy for ME handshake. | +| me_route_backpressure_base_timeout_ms | `u64` | `25` | Must be `> 0`. | Base backpressure timeout for route-channel send (ms). | +| me_route_backpressure_high_timeout_ms | `u64` | `120` | Must be `>= me_route_backpressure_base_timeout_ms`. | High backpressure timeout when queue occupancy exceeds watermark (ms). | +| me_route_backpressure_high_watermark_pct | `u8` | `80` | `1..=100`. | Queue occupancy threshold (%) for high timeout mode. | +| me_health_interval_ms_unhealthy | `u64` | `1000` | Must be `> 0`. | Health monitor interval while writer coverage is degraded (ms). | +| me_health_interval_ms_healthy | `u64` | `3000` | Must be `> 0`. | Health monitor interval while writer coverage is healthy (ms). | +| me_admission_poll_ms | `u64` | `1000` | Must be `> 0`. | Poll interval for conditional-admission checks (ms). | +| me_warn_rate_limit_ms | `u64` | `5000` | Must be `> 0`. | Cooldown for repetitive ME warning logs (ms). | +| me_route_no_writer_mode | `"async_recovery_failfast" \| "inline_recovery_legacy" \| "hybrid_async_persistent"` | `"hybrid_async_persistent"` | — | Route behavior when no writer is immediately available. | +| me_route_no_writer_wait_ms | `u64` | `250` | `10..=5000`. | Max wait in async-recovery failfast mode (ms). | +| me_route_inline_recovery_attempts | `u32` | `3` | Must be `> 0`. | Inline recovery attempts in legacy mode. | +| me_route_inline_recovery_wait_ms | `u64` | `3000` | `10..=30000`. | Max inline recovery wait in legacy mode (ms). | +| fast_mode_min_tls_record | `usize` | `0` | — | Minimum TLS record size when fast-mode coalescing is enabled (`0` disables). | +| update_every | `u64 \| null` | `300` | If set: must be `> 0`; if `null`: legacy fallback path is used. | Unified refresh interval for ME config and proxy-secret updater tasks. | +| me_reinit_every_secs | `u64` | `900` | Must be `> 0`. | Periodic interval for zero-downtime ME reinit cycle. | +| me_hardswap_warmup_delay_min_ms | `u64` | `1000` | Must be `<= me_hardswap_warmup_delay_max_ms`. | Lower bound for hardswap warmup dial spacing. | +| me_hardswap_warmup_delay_max_ms | `u64` | `2000` | Must be `> 0`. | Upper bound for hardswap warmup dial spacing. | +| me_hardswap_warmup_extra_passes | `u8` | `3` | Must be within `[0, 10]`. | Additional warmup passes after the base pass in one hardswap cycle. | +| me_hardswap_warmup_pass_backoff_base_ms | `u64` | `500` | Must be `> 0`. | Base backoff between extra hardswap warmup passes. | +| me_config_stable_snapshots | `u8` | `2` | Must be `> 0`. | Number of identical ME config snapshots required before apply. | +| me_config_apply_cooldown_secs | `u64` | `300` | none | Cooldown between applied ME endpoint-map updates. | +| me_snapshot_require_http_2xx | `bool` | `true` | — | Requires 2xx HTTP responses for applying config snapshots. | +| me_snapshot_reject_empty_map | `bool` | `true` | — | Rejects empty config snapshots. | +| me_snapshot_min_proxy_for_lines | `u32` | `1` | Must be `> 0`. | Minimum parsed `proxy_for` rows required to accept snapshot. | +| proxy_secret_stable_snapshots | `u8` | `2` | Must be `> 0`. | Number of identical proxy-secret snapshots required before rotation. | +| proxy_secret_rotate_runtime | `bool` | `true` | none | Enables runtime proxy-secret rotation from updater snapshots. | +| me_secret_atomic_snapshot | `bool` | `true` | — | Keeps selector and secret bytes from the same snapshot atomically. | +| proxy_secret_len_max | `usize` | `256` | Must be within `[32, 4096]`. | Upper length limit for accepted proxy-secret bytes. | +| me_pool_drain_ttl_secs | `u64` | `90` | none | Time window where stale writers remain fallback-eligible after map change. | +| me_pool_drain_threshold | `u64` | `128` | — | Max draining stale writers before batch force-close (`0` disables threshold cleanup). | +| me_pool_drain_soft_evict_enabled | `bool` | `true` | — | Enables gradual soft-eviction of stale writers during drain/reinit instead of immediate hard close. | +| me_pool_drain_soft_evict_grace_secs | `u64` | `30` | `0..=3600`. | Grace period before stale writers become soft-evict candidates. | +| me_pool_drain_soft_evict_per_writer | `u8` | `1` | `1..=16`. | Maximum stale routes soft-evicted per writer in one eviction pass. | +| me_pool_drain_soft_evict_budget_per_core | `u16` | `8` | `1..=64`. | Per-core budget limiting aggregate soft-eviction work per pass. | +| me_pool_drain_soft_evict_cooldown_ms | `u64` | `5000` | Must be `> 0`. | Cooldown between consecutive soft-eviction passes (ms). | +| me_bind_stale_mode | `"never" \| "ttl" \| "always"` | `"ttl"` | — | Policy for new binds on stale draining writers. | +| me_bind_stale_ttl_secs | `u64` | `90` | — | TTL for stale bind allowance when stale mode is `ttl`. | +| me_pool_min_fresh_ratio | `f32` | `0.8` | Must be within `[0.0, 1.0]`. | Minimum fresh desired-DC coverage ratio before stale writers are drained. | +| me_reinit_drain_timeout_secs | `u64` | `120` | `0` disables force-close; if `> 0` and `< me_pool_drain_ttl_secs`, runtime bumps it to TTL. | Force-close timeout for draining stale writers (`0` keeps indefinite draining). | +| proxy_secret_auto_reload_secs | `u64` | `3600` | Deprecated. Use `general.update_every`. | Deprecated legacy secret reload interval (fallback when `update_every` is not set). | +| proxy_config_auto_reload_secs | `u64` | `3600` | Deprecated. Use `general.update_every`. | Deprecated legacy config reload interval (fallback when `update_every` is not set). | +| me_reinit_singleflight | `bool` | `true` | — | Serializes ME reinit cycles across trigger sources. | +| me_reinit_trigger_channel | `usize` | `64` | Must be `> 0`. | Trigger queue capacity for reinit scheduler. | +| me_reinit_coalesce_window_ms | `u64` | `200` | — | Trigger coalescing window before starting reinit (ms). | +| me_deterministic_writer_sort | `bool` | `true` | — | Enables deterministic candidate sort for writer binding path. | +| me_writer_pick_mode | `"sorted_rr" \| "p2c"` | `"p2c"` | — | Writer selection mode for route bind path. | +| me_writer_pick_sample_size | `u8` | `3` | `2..=4`. | Number of candidates sampled by picker in `p2c` mode. | +| ntp_check | `bool` | `true` | — | Enables NTP drift check at startup. | +| ntp_servers | `String[]` | `["pool.ntp.org"]` | — | NTP servers used for drift check. | +| auto_degradation_enabled | `bool` | `true` | none | Reserved compatibility flag in current runtime revision. | +| degradation_min_unavailable_dc_groups | `u8` | `2` | none | Reserved compatibility threshold in current runtime revision. | + +## [general.modes] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| classic | `bool` | `false` | — | Enables classic MTProxy mode. | +| secure | `bool` | `false` | — | Enables secure mode. | +| tls | `bool` | `true` | — | Enables TLS mode. | + +## [general.links] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| show | `"*" \| String[]` | `"*"` | — | Selects users whose tg:// links are shown at startup. | +| public_host | `String \| null` | `null` | — | Public hostname/IP override for generated tg:// links. | +| public_port | `u16 \| null` | `null` | — | Public port override for generated tg:// links. | + +## [general.telemetry] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| core_enabled | `bool` | `true` | — | Enables core hot-path telemetry counters. | +| user_enabled | `bool` | `true` | — | Enables per-user telemetry counters. | +| me_level | `"silent" \| "normal" \| "debug"` | `"normal"` | — | Middle-End telemetry verbosity level. | + +## [network] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| ipv4 | `bool` | `true` | — | Enables IPv4 networking. | +| ipv6 | `bool` | `false` | — | Enables/disables IPv6 when set | +| prefer | `u8` | `4` | Must be `4` or `6`. | Preferred IP family for selection (`4` or `6`). | +| multipath | `bool` | `false` | — | Enables multipath behavior where supported. | +| stun_use | `bool` | `true` | none | Global STUN switch; when `false`, STUN probing path is disabled. | +| stun_servers | `String[]` | Built-in STUN list (13 hosts) | Deduplicated; empty values are removed. | Primary STUN server list for NAT/public endpoint discovery. | +| stun_tcp_fallback | `bool` | `true` | none | Enables TCP fallback for STUN when UDP path is blocked. | +| http_ip_detect_urls | `String[]` | `["https://ifconfig.me/ip", "https://api.ipify.org"]` | none | HTTP fallback endpoints for public IP detection when STUN is unavailable. | +| cache_public_ip_path | `String` | `"cache/public_ip.txt"` | — | File path for caching detected public IP. | +| dns_overrides | `String[]` | `[]` | Must match `host:port:ip`; IPv6 must be bracketed. | Runtime DNS overrides in `host:port:ip` format. | + +## [server] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| port | `u16` | `443` | — | Main proxy listen port. | +| listen_addr_ipv4 | `String \| null` | `"0.0.0.0"` | — | IPv4 bind address for TCP listener. | +| listen_addr_ipv6 | `String \| null` | `"::"` | — | IPv6 bind address for TCP listener. | +| listen_unix_sock | `String \| null` | `null` | — | Unix socket path for listener. | +| listen_unix_sock_perm | `String \| null` | `null` | — | Unix socket permissions in octal string (e.g., `"0666"`). | +| listen_tcp | `bool \| null` | `null` (auto) | — | Explicit TCP listener enable/disable override. | +| proxy_protocol | `bool` | `false` | — | Enables HAProxy PROXY protocol parsing on incoming client connections. | +| proxy_protocol_header_timeout_ms | `u64` | `500` | Must be `> 0`. | Timeout for PROXY protocol header read/parse (ms). | +| metrics_port | `u16 \| null` | `null` | — | Metrics endpoint port (enables metrics listener). | +| metrics_listen | `String \| null` | `null` | — | Full metrics bind address (`IP:PORT`), overrides `metrics_port`. | +| metrics_whitelist | `IpNetwork[]` | `["127.0.0.1/32", "::1/128"]` | — | CIDR whitelist for metrics endpoint access. | +| max_connections | `u32` | `10000` | — | Max concurrent client connections (`0` = unlimited). | + +## [server.api] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| enabled | `bool` | `true` | — | Enables control-plane REST API. | +| listen | `String` | `"0.0.0.0:9091"` | Must be valid `IP:PORT`. | API bind address in `IP:PORT` format. | +| whitelist | `IpNetwork[]` | `["127.0.0.0/8"]` | — | CIDR whitelist allowed to access API. | +| auth_header | `String` | `""` | — | Exact expected `Authorization` header value (empty = disabled). | +| request_body_limit_bytes | `usize` | `65536` | Must be `> 0`. | Maximum accepted HTTP request body size. | +| minimal_runtime_enabled | `bool` | `true` | — | Enables minimal runtime snapshots endpoint logic. | +| minimal_runtime_cache_ttl_ms | `u64` | `1000` | `0..=60000`. | Cache TTL for minimal runtime snapshots (ms; `0` disables cache). | +| runtime_edge_enabled | `bool` | `false` | — | Enables runtime edge endpoints. | +| runtime_edge_cache_ttl_ms | `u64` | `1000` | `0..=60000`. | Cache TTL for runtime edge aggregation payloads (ms). | +| runtime_edge_top_n | `usize` | `10` | `1..=1000`. | Top-N size for edge connection leaderboard. | +| runtime_edge_events_capacity | `usize` | `256` | `16..=4096`. | Ring-buffer capacity for runtime edge events. | +| read_only | `bool` | `false` | — | Rejects mutating API endpoints when enabled. | + +## [[server.listeners]] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| ip | `IpAddr` | — | — | Listener bind IP. | +| announce | `String \| null` | — | — | Public IP/domain announced in proxy links (priority over `announce_ip`). | +| announce_ip | `IpAddr \| null` | — | — | Deprecated legacy announce IP (migrated to `announce` if needed). | +| proxy_protocol | `bool \| null` | `null` | — | Per-listener override for PROXY protocol enable flag. | +| reuse_allow | `bool` | `false` | — | Enables `SO_REUSEPORT` for multi-instance bind sharing. | + +## [timeouts] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| client_handshake | `u64` | `30` | — | Client handshake timeout. | +| tg_connect | `u64` | `10` | — | Upstream Telegram connect timeout. | +| client_keepalive | `u64` | `15` | — | Client keepalive timeout. | +| client_ack | `u64` | `90` | — | Client ACK timeout. | +| me_one_retry | `u8` | `12` | none | Fast reconnect attempts budget for single-endpoint DC scenarios. | +| me_one_timeout_ms | `u64` | `1200` | none | Timeout in milliseconds for each quick single-endpoint reconnect attempt. | + +## [censorship] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| tls_domain | `String` | `"petrovich.ru"` | — | Primary TLS domain used in fake TLS handshake profile. | +| tls_domains | `String[]` | `[]` | — | Additional TLS domains for generating multiple links. | +| mask | `bool` | `true` | — | Enables masking/fronting relay mode. | +| mask_host | `String \| null` | `null` | — | Upstream mask host for TLS fronting relay. | +| mask_port | `u16` | `443` | — | Upstream mask port for TLS fronting relay. | +| mask_unix_sock | `String \| null` | `null` | — | Unix socket path for mask backend instead of TCP host/port. | +| fake_cert_len | `usize` | `2048` | — | Length of synthetic certificate payload when emulation data is unavailable. | +| tls_emulation | `bool` | `true` | — | Enables certificate/TLS behavior emulation from cached real fronts. | +| tls_front_dir | `String` | `"tlsfront"` | — | Directory path for TLS front cache storage. | +| server_hello_delay_min_ms | `u64` | `0` | — | Minimum server_hello delay for anti-fingerprint behavior (ms). | +| server_hello_delay_max_ms | `u64` | `0` | — | Maximum server_hello delay for anti-fingerprint behavior (ms). | +| tls_new_session_tickets | `u8` | `0` | — | Number of `NewSessionTicket` messages to emit after handshake. | +| tls_full_cert_ttl_secs | `u64` | `90` | — | TTL for sending full cert payload per (domain, client IP) tuple. | +| alpn_enforce | `bool` | `true` | — | Enforces ALPN echo behavior based on client preference. | +| mask_proxy_protocol | `u8` | `0` | — | PROXY protocol mode for mask backend (`0` disabled, `1` v1, `2` v2). | + +## [access] + +| Parameter | Type | Default | Constraints / validation | TOML shape example | Description | +|---|---|---|---|---|---| +| users | `Map` | `{"default": "000…000"}` | Secret must be 32 hex characters. | `[access.users]`
`user = "32-hex secret"`
`user2 = "32-hex secret"` | User credentials map used for client authentication. | +| user_ad_tags | `Map` | `{}` | Every value must be exactly 32 hex characters. | `[access.user_ad_tags]`
`user = "32-hex ad_tag"` | Per-user ad tags used as override over `general.ad_tag`. | +| user_max_tcp_conns | `Map` | `{}` | — | `[access.user_max_tcp_conns]`
`user = 500` | Per-user maximum concurrent TCP connections. | +| user_expirations | `Map>` | `{}` | Timestamp must be valid RFC3339/ISO-8601 datetime. | `[access.user_expirations]`
`user = "2026-12-31T23:59:59Z"` | Per-user account expiration timestamps. | +| user_data_quota | `Map` | `{}` | — | `[access.user_data_quota]`
`user = 1073741824` | Per-user traffic quota in bytes. | +| user_max_unique_ips | `Map` | `{}` | — | `[access.user_max_unique_ips]`
`user = 16` | Per-user unique source IP limits. | +| user_max_unique_ips_global_each | `usize` | `0` | — | `user_max_unique_ips_global_each = 0` | Global fallback used when `[access.user_max_unique_ips]` has no per-user override. | +| user_max_unique_ips_mode | `"active_window" \| "time_window" \| "combined"` | `"active_window"` | — | `user_max_unique_ips_mode = "active_window"` | Unique source IP limit accounting mode. | +| user_max_unique_ips_window_secs | `u64` | `30` | Must be `> 0`. | `user_max_unique_ips_window_secs = 30` | Window size (seconds) used by unique-IP accounting modes that use time windows. | +| replay_check_len | `usize` | `65536` | — | `replay_check_len = 65536` | Replay-protection storage length. | +| replay_window_secs | `u64` | `1800` | — | `replay_window_secs = 1800` | Replay-protection window in seconds. | +| ignore_time_skew | `bool` | `false` | — | `ignore_time_skew = false` | Disables client/server timestamp skew checks in replay validation when enabled. | + +## [[upstreams]] + +| Parameter | Type | Default | Constraints / validation | Description | +|---|---|---|---|---| +| type | `"direct" \| "socks4" \| "socks5"` | — | Required field. | Upstream transport type selector. | +| weight | `u16` | `1` | none | Base weight used by weighted-random upstream selection. | +| enabled | `bool` | `true` | none | Disabled entries are excluded from upstream selection at runtime. | +| scopes | `String` | `""` | none | Comma-separated scope tags used for request-level upstream filtering. | +| interface | `String \| null` | `null` | Optional; type-specific runtime rules apply. | Optional outbound interface/local bind hint (supported with type-specific rules). | +| bind_addresses | `String[] \| null` | `null` | Applies to `type = "direct"`. | Optional explicit local source bind addresses for `type = "direct"`. | +| address | `String` | — | Required for `type = "socks4"` and `type = "socks5"`. | SOCKS server endpoint (`host:port` or `ip:port`) for SOCKS upstream types. | +| user_id | `String \| null` | `null` | Only for `type = "socks4"`. | SOCKS4 CONNECT user ID (`type = "socks4"` only). | +| username | `String \| null` | `null` | Only for `type = "socks5"`. | SOCKS5 username (`type = "socks5"` only). | +| password | `String \| null` | `null` | Only for `type = "socks5"`. | SOCKS5 password (`type = "socks5"` only). | diff --git a/install.sh b/install.sh index 2dd207b..330bc3e 100644 --- a/install.sh +++ b/install.sh @@ -1,115 +1,525 @@ #!/bin/sh set -eu +# --- Global Configurations --- REPO="${REPO:-telemt/telemt}" BIN_NAME="${BIN_NAME:-telemt}" -VERSION="${1:-${VERSION:-latest}}" -INSTALL_DIR="${INSTALL_DIR:-/usr/local/bin}" +INSTALL_DIR="${INSTALL_DIR:-/bin}" +CONFIG_DIR="${CONFIG_DIR:-/etc/telemt}" +CONFIG_FILE="${CONFIG_FILE:-${CONFIG_DIR}/telemt.toml}" +WORK_DIR="${WORK_DIR:-/opt/telemt}" +SERVICE_NAME="telemt" +TEMP_DIR="" +SUDO="" -say() { - printf '%s\n' "$*" -} +# --- Argument Parsing --- +ACTION="install" +TARGET_VERSION="${VERSION:-latest}" -die() { - printf 'Error: %s\n' "$*" >&2 - exit 1 -} - -need_cmd() { - command -v "$1" >/dev/null 2>&1 || die "required command not found: $1" -} - -detect_os() { - os="$(uname -s)" - case "$os" in - Linux) printf 'linux\n' ;; - OpenBSD) printf 'openbsd\n' ;; - *) printf '%s\n' "$os" ;; +while [ $# -gt 0 ]; do + case "$1" in + -h|--help) + ACTION="help" + shift + ;; + uninstall|--uninstall) + [ "$ACTION" != "purge" ] && ACTION="uninstall" + shift + ;; + --purge) + ACTION="purge" + shift + ;; + install|--install) + ACTION="install" + shift + ;; + -*) + printf '[ERROR] Unknown option: %s\n' "$1" >&2 + exit 1 + ;; + *) + if [ "$ACTION" = "install" ]; then + TARGET_VERSION="$1" + fi + shift + ;; esac +done + +# --- Core Functions --- +say() { printf '[INFO] %s\n' "$*"; } +die() { printf '[ERROR] %s\n' "$*" >&2; exit 1; } + +cleanup() { + if [ -n "${TEMP_DIR:-}" ] && [ -d "$TEMP_DIR" ]; then + rm -rf -- "$TEMP_DIR" + fi +} + +trap cleanup EXIT INT TERM + +show_help() { + say "Usage: $0 [version | install | uninstall | --purge | --help]" + say " version Install specific version (e.g. 1.0.0, default: latest)" + say " uninstall Remove the binary and service (keeps config)" + say " --purge Remove everything including configuration" + exit 0 +} + +user_exists() { + if command -v getent >/dev/null 2>&1; then + getent passwd "$1" >/dev/null 2>&1 + else + grep -q "^${1}:" /etc/passwd 2>/dev/null + fi +} + +group_exists() { + if command -v getent >/dev/null 2>&1; then + getent group "$1" >/dev/null 2>&1 + else + grep -q "^${1}:" /etc/group 2>/dev/null + fi +} + +verify_common() { + [ -z "$BIN_NAME" ] && die "BIN_NAME cannot be empty." + [ -z "$INSTALL_DIR" ] && die "INSTALL_DIR cannot be empty." + [ -z "$CONFIG_DIR" ] && die "CONFIG_DIR cannot be empty." + + if [ "$(id -u)" -eq 0 ]; then + SUDO="" + else + if ! command -v sudo >/dev/null 2>&1; then + die "This script requires root or sudo. Neither found." + fi + SUDO="sudo" + say "sudo is available. Caching credentials..." + if ! sudo -v; then + die "Failed to cache sudo credentials" + fi + fi + + case "${INSTALL_DIR}${CONFIG_DIR}${WORK_DIR}" in + *[!a-zA-Z0-9_./-]*) + die "Invalid characters in path variables. Only alphanumeric, _, ., -, and / are allowed." + ;; + esac + + case "$BIN_NAME" in + *[!a-zA-Z0-9_-]*) die "Invalid characters in BIN_NAME: $BIN_NAME" ;; + esac + + for path in "$CONFIG_DIR" "$WORK_DIR"; do + check_path="$path" + + while [ "$check_path" != "/" ] && [ "${check_path%"/"}" != "$check_path" ]; do + check_path="${check_path%"/"}" + done + [ -z "$check_path" ] && check_path="/" + + case "$check_path" in + /|/bin|/sbin|/usr|/usr/bin|/usr/local|/etc|/opt|/var|/home|/root|/tmp) + die "Safety check failed: '$path' is a critical system directory." + ;; + esac + done + + for cmd in uname grep find rm chown chmod mv head mktemp; do + command -v "$cmd" >/dev/null 2>&1 || die "Required command not found: $cmd" + done +} + +verify_install_deps() { + if ! command -v curl >/dev/null 2>&1 && ! command -v wget >/dev/null 2>&1; then + die "Neither curl nor wget is installed." + fi + command -v tar >/dev/null 2>&1 || die "Required command not found: tar" + command -v gzip >/dev/null 2>&1 || die "Required command not found: gzip" + command -v cp >/dev/null 2>&1 || command -v install >/dev/null 2>&1 || die "Need cp or install" + + if ! command -v setcap >/dev/null 2>&1; then + say "setcap is missing. Installing required capability tools..." + if command -v apk >/dev/null 2>&1; then + $SUDO apk add --no-cache libcap || die "Failed to install libcap" + elif command -v apt-get >/dev/null 2>&1; then + $SUDO apt-get update -qq && $SUDO apt-get install -y -qq libcap2-bin || die "Failed to install libcap2-bin" + elif command -v dnf >/dev/null 2>&1 || command -v yum >/dev/null 2>&1; then + $SUDO ${YUM_CMD:-yum} install -y -q libcap || die "Failed to install libcap" + else + die "Cannot install 'setcap'. Package manager not found. Please install libcap manually." + fi + fi } detect_arch() { - arch="$(uname -m)" - case "$arch" in - x86_64|amd64) printf 'x86_64\n' ;; - aarch64|arm64) printf 'aarch64\n' ;; - *) die "unsupported architecture: $arch" ;; + sys_arch="$(uname -m)" + case "$sys_arch" in + x86_64|amd64) echo "x86_64" ;; + aarch64|arm64) echo "aarch64" ;; + *) die "Unsupported architecture: $sys_arch" ;; esac } detect_libc() { - case "$(ldd --version 2>&1 || true)" in - *musl*) printf 'musl\n' ;; - *) printf 'gnu\n' ;; - esac + if command -v ldd >/dev/null 2>&1 && ldd --version 2>&1 | grep -qi musl; then + echo "musl"; return 0 + fi + + if grep -q '^ID=alpine' /etc/os-release 2>/dev/null || grep -q '^ID="alpine"' /etc/os-release 2>/dev/null; then + echo "musl"; return 0 + fi + for f in /lib/ld-musl-*.so.* /lib64/ld-musl-*.so.*; do + if [ -e "$f" ]; then + echo "musl"; return 0 + fi + done + echo "gnu" } -fetch_to_stdout() { - url="$1" +fetch_file() { + fetch_url="$1" + fetch_out="$2" + if command -v curl >/dev/null 2>&1; then - curl -fsSL "$url" + curl -fsSL "$fetch_url" -o "$fetch_out" || return 1 elif command -v wget >/dev/null 2>&1; then - wget -qO- "$url" + wget -qO "$fetch_out" "$fetch_url" || return 1 else - die "neither curl nor wget is installed" + die "curl or wget required" + fi +} + +ensure_user_group() { + nologin_bin="/bin/false" + + cmd_nologin="$(command -v nologin 2>/dev/null || true)" + if [ -n "$cmd_nologin" ] && [ -x "$cmd_nologin" ]; then + nologin_bin="$cmd_nologin" + else + for bin in /sbin/nologin /usr/sbin/nologin; do + if [ -x "$bin" ]; then + nologin_bin="$bin" + break + fi + done + fi + + if ! group_exists telemt; then + if command -v groupadd >/dev/null 2>&1; then + $SUDO groupadd -r telemt || die "Failed to create group via groupadd" + elif command -v addgroup >/dev/null 2>&1; then + $SUDO addgroup -S telemt || die "Failed to create group via addgroup" + else + die "Cannot create group: neither groupadd nor addgroup found" + fi + fi + + if ! user_exists telemt; then + if command -v useradd >/dev/null 2>&1; then + $SUDO useradd -r -g telemt -d "$WORK_DIR" -s "$nologin_bin" -c "Telemt Proxy" telemt || die "Failed to create user via useradd" + elif command -v adduser >/dev/null 2>&1; then + $SUDO adduser -S -D -H -h "$WORK_DIR" -s "$nologin_bin" -G telemt telemt || die "Failed to create user via adduser" + else + die "Cannot create user: neither useradd nor adduser found" + fi + fi +} + +setup_dirs() { + say "Setting up directories..." + $SUDO mkdir -p "$WORK_DIR" "$CONFIG_DIR" || die "Failed to create directories" + $SUDO chown telemt:telemt "$WORK_DIR" || die "Failed to set owner on WORK_DIR" + $SUDO chmod 750 "$WORK_DIR" || die "Failed to set permissions on WORK_DIR" +} + +stop_service() { + say "Stopping service if running..." + if command -v systemctl >/dev/null 2>&1 && [ -d /run/systemd/system ]; then + $SUDO systemctl stop "$SERVICE_NAME" 2>/dev/null || true + elif command -v rc-service >/dev/null 2>&1; then + $SUDO rc-service "$SERVICE_NAME" stop 2>/dev/null || true fi } install_binary() { - src="$1" - dst="$2" + bin_src="$1" + bin_dst="$2" - if [ -w "$INSTALL_DIR" ] || { [ ! -e "$INSTALL_DIR" ] && [ -w "$(dirname "$INSTALL_DIR")" ]; }; then - mkdir -p "$INSTALL_DIR" - install -m 0755 "$src" "$dst" - elif command -v sudo >/dev/null 2>&1; then - sudo mkdir -p "$INSTALL_DIR" - sudo install -m 0755 "$src" "$dst" + $SUDO mkdir -p "$INSTALL_DIR" || die "Failed to create install directory" + if command -v install >/dev/null 2>&1; then + $SUDO install -m 0755 "$bin_src" "$bin_dst" || die "Failed to install binary" else - die "cannot write to $INSTALL_DIR and sudo is not available" + $SUDO rm -f "$bin_dst" + $SUDO cp "$bin_src" "$bin_dst" || die "Failed to copy binary" + $SUDO chmod 0755 "$bin_dst" || die "Failed to set permissions" + fi + + if [ ! -x "$bin_dst" ]; then + die "Failed to install binary or it is not executable: $bin_dst" + fi + + say "Granting network bind capabilities to bind port 443..." + if ! $SUDO setcap cap_net_bind_service=+ep "$bin_dst" 2>/dev/null; then + say "[WARNING] Failed to apply setcap. The service will NOT be able to open port 443!" + say "[WARNING] This usually happens inside unprivileged Docker/LXC containers." fi } -need_cmd uname -need_cmd tar -need_cmd mktemp -need_cmd grep -need_cmd install +generate_secret() { + if command -v openssl >/dev/null 2>&1; then + secret="$(openssl rand -hex 16 2>/dev/null)" && [ -n "$secret" ] && { echo "$secret"; return 0; } + fi + if command -v xxd >/dev/null 2>&1; then + secret="$(dd if=/dev/urandom bs=1 count=16 2>/dev/null | xxd -p | tr -d '\n')" && [ -n "$secret" ] && { echo "$secret"; return 0; } + fi + secret="$(dd if=/dev/urandom bs=1 count=16 2>/dev/null | od -An -tx1 | tr -d ' \n')" && [ -n "$secret" ] && { echo "$secret"; return 0; } + return 1 +} -ARCH="$(detect_arch)" -OS="$(detect_os)" +generate_config_content() { + cat </dev/null && config_exists=1 || true + else + [ -f "$CONFIG_FILE" ] && config_exists=1 || true + fi + + if [ "$config_exists" -eq 1 ]; then + say "Config already exists, skipping generation." + return 0 + fi + + toml_secret="$(generate_secret)" || die "Failed to generate secret" + say "Creating config at $CONFIG_FILE..." + + tmp_conf="$(mktemp "${TEMP_DIR:-/tmp}/telemt_conf.XXXXXX")" || die "Failed to create temp config" + generate_config_content "$toml_secret" > "$tmp_conf" || die "Failed to write temp config" + + $SUDO mv "$tmp_conf" "$CONFIG_FILE" || die "Failed to install config file" + $SUDO chown root:telemt "$CONFIG_FILE" || die "Failed to set owner" + $SUDO chmod 640 "$CONFIG_FILE" || die "Failed to set config permissions" + + say "Secret for user 'hello': $toml_secret" +} + +generate_systemd_content() { + cat </dev/null 2>&1 && [ -d /run/systemd/system ]; then + say "Installing systemd service..." + tmp_svc="$(mktemp "${TEMP_DIR:-/tmp}/${SERVICE_NAME}.service.XXXXXX")" || die "Failed to create temp service" + generate_systemd_content > "$tmp_svc" || die "Failed to generate service content" + + $SUDO mv "$tmp_svc" "/etc/systemd/system/${SERVICE_NAME}.service" || die "Failed to move service file" + $SUDO chown root:root "/etc/systemd/system/${SERVICE_NAME}.service" + $SUDO chmod 644 "/etc/systemd/system/${SERVICE_NAME}.service" + + $SUDO systemctl daemon-reload || die "Failed to reload systemd" + $SUDO systemctl enable "$SERVICE_NAME" || die "Failed to enable service" + $SUDO systemctl start "$SERVICE_NAME" || die "Failed to start service" + + elif command -v rc-update >/dev/null 2>&1; then + say "Installing OpenRC service..." + tmp_svc="$(mktemp "${TEMP_DIR:-/tmp}/${SERVICE_NAME}.init.XXXXXX")" || die "Failed to create temp file" + generate_openrc_content > "$tmp_svc" || die "Failed to generate init content" + + $SUDO mv "$tmp_svc" "/etc/init.d/${SERVICE_NAME}" || die "Failed to move service file" + $SUDO chown root:root "/etc/init.d/${SERVICE_NAME}" + $SUDO chmod 0755 "/etc/init.d/${SERVICE_NAME}" + + $SUDO rc-update add "$SERVICE_NAME" default 2>/dev/null || die "Failed to register service" + $SUDO rc-service "$SERVICE_NAME" start 2>/dev/null || die "Failed to start OpenRC service" + else + say "No service manager found. You can start it manually with:" + if [ -n "$SUDO" ]; then + say " sudo -u telemt ${INSTALL_DIR}/${BIN_NAME} ${CONFIG_FILE}" + else + say " su -s /bin/sh telemt -c '${INSTALL_DIR}/${BIN_NAME} ${CONFIG_FILE}'" + fi + fi +} + +kill_user_procs() { + say "Ensuring $BIN_NAME processes are killed..." + + if pkill_cmd="$(command -v pkill 2>/dev/null)"; then + $SUDO "$pkill_cmd" -u telemt "$BIN_NAME" 2>/dev/null || true + sleep 1 + $SUDO "$pkill_cmd" -9 -u telemt "$BIN_NAME" 2>/dev/null || true + elif killall_cmd="$(command -v killall 2>/dev/null)"; then + $SUDO "$killall_cmd" "$BIN_NAME" 2>/dev/null || true + sleep 1 + $SUDO "$killall_cmd" -9 "$BIN_NAME" 2>/dev/null || true + fi +} + +uninstall() { + purge_data=0 + [ "$ACTION" = "purge" ] && purge_data=1 + + say "Uninstalling $BIN_NAME..." + stop_service + + if command -v systemctl >/dev/null 2>&1 && [ -d /run/systemd/system ]; then + $SUDO systemctl disable "$SERVICE_NAME" 2>/dev/null || true + $SUDO rm -f "/etc/systemd/system/${SERVICE_NAME}.service" + $SUDO systemctl daemon-reload || true + elif command -v rc-update >/dev/null 2>&1; then + $SUDO rc-update del "$SERVICE_NAME" 2>/dev/null || true + $SUDO rm -f "/etc/init.d/${SERVICE_NAME}" + fi + + kill_user_procs + + $SUDO rm -f "${INSTALL_DIR}/${BIN_NAME}" + + $SUDO userdel telemt 2>/dev/null || $SUDO deluser telemt 2>/dev/null || true + $SUDO groupdel telemt 2>/dev/null || $SUDO delgroup telemt 2>/dev/null || true + + if [ "$purge_data" -eq 1 ]; then + say "Purging configuration and data..." + $SUDO rm -rf "$CONFIG_DIR" "$WORK_DIR" + else + say "Note: Configuration in $CONFIG_DIR was kept. Run with '--purge' to remove it." + fi + + say "Uninstallation complete." + exit 0 +} + +# ============================================================================ +# Main Entry Point +# ============================================================================ + +case "$ACTION" in + help) + show_help ;; - *) - URL="https://github.com/$REPO/releases/download/${VERSION}/${BIN_NAME}-${ARCH}-linux-${LIBC}.tar.gz" + uninstall|purge) + verify_common + uninstall + ;; + install) + say "Starting installation..." + verify_common + verify_install_deps + + ARCH="$(detect_arch)" + LIBC="$(detect_libc)" + say "Detected system: $ARCH-linux-$LIBC" + + FILE_NAME="${BIN_NAME}-${ARCH}-linux-${LIBC}.tar.gz" + FILE_NAME="$(printf '%s' "$FILE_NAME" | tr -d ' \t\n\r')" + + if [ "$TARGET_VERSION" = "latest" ]; then + DL_URL="https://github.com/${REPO}/releases/latest/download/${FILE_NAME}" + else + DL_URL="https://github.com/${REPO}/releases/download/${TARGET_VERSION}/${FILE_NAME}" + fi + + TEMP_DIR="$(mktemp -d)" || die "Failed to create temp directory" + if [ -z "$TEMP_DIR" ] || [ ! -d "$TEMP_DIR" ]; then + die "Temp directory creation failed" + fi + + say "Downloading from $DL_URL..." + fetch_file "$DL_URL" "${TEMP_DIR}/archive.tar.gz" || die "Download failed (check version or network)" + + gzip -dc "${TEMP_DIR}/archive.tar.gz" | tar -xf - -C "$TEMP_DIR" || die "Extraction failed" + + EXTRACTED_BIN="$(find "$TEMP_DIR" -type f -name "$BIN_NAME" -print 2>/dev/null | head -n 1)" + [ -z "$EXTRACTED_BIN" ] && die "Binary '$BIN_NAME' not found in archive" + + ensure_user_group + setup_dirs + stop_service + + say "Installing binary..." + install_binary "$EXTRACTED_BIN" "${INSTALL_DIR}/${BIN_NAME}" + + install_config + install_service + + say "" + say "=============================================" + say "Installation complete!" + say "=============================================" + if command -v systemctl >/dev/null 2>&1 && [ -d /run/systemd/system ]; then + say "To check the logs, run:" + say " journalctl -u $SERVICE_NAME -f" + say "" + fi + say "To get user connection links, run:" + if command -v jq >/dev/null 2>&1; then + say " curl -s http://127.0.0.1:9091/v1/users | jq -r '.data[] | \"User: \\(.username)\\n\\(.links.tls[0] // empty)\"'" + else + say " curl -s http://127.0.0.1:9091/v1/users" + say " (Note: Install 'jq' package to see the links nicely formatted)" + fi ;; esac - -TMPDIR="$(mktemp -d)" -trap 'rm -rf "$TMPDIR"' EXIT INT TERM - -say "Installing $BIN_NAME ($VERSION) for $ARCH-linux-$LIBC..." -fetch_to_stdout "$URL" | tar -xzf - -C "$TMPDIR" - -[ -f "$TMPDIR/$BIN_NAME" ] || die "archive did not contain $BIN_NAME" - -install_binary "$TMPDIR/$BIN_NAME" "$INSTALL_DIR/$BIN_NAME" - -say "Installed: $INSTALL_DIR/$BIN_NAME" -"$INSTALL_DIR/$BIN_NAME" --version 2>/dev/null || true diff --git a/src/api/model.rs b/src/api/model.rs index 31233d7..ac4e297 100644 --- a/src/api/model.rs +++ b/src/api/model.rs @@ -195,6 +195,8 @@ pub(super) struct ZeroPoolData { pub(super) pool_swap_total: u64, pub(super) pool_drain_active: u64, pub(super) pool_force_close_total: u64, + pub(super) pool_drain_soft_evict_total: u64, + pub(super) pool_drain_soft_evict_writer_total: u64, pub(super) pool_stale_pick_total: u64, pub(super) writer_removed_total: u64, pub(super) writer_removed_unexpected_total: u64, @@ -235,6 +237,7 @@ pub(super) struct MeWritersSummary { pub(super) available_pct: f64, pub(super) required_writers: usize, pub(super) alive_writers: usize, + pub(super) coverage_ratio: f64, pub(super) coverage_pct: f64, pub(super) fresh_alive_writers: usize, pub(super) fresh_coverage_pct: f64, @@ -283,6 +286,7 @@ pub(super) struct DcStatus { pub(super) floor_max: usize, pub(super) floor_capped: bool, pub(super) alive_writers: usize, + pub(super) coverage_ratio: f64, pub(super) coverage_pct: f64, pub(super) fresh_alive_writers: usize, pub(super) fresh_coverage_pct: f64, @@ -360,6 +364,11 @@ pub(super) struct MinimalMeRuntimeData { pub(super) me_reconnect_backoff_cap_ms: u64, pub(super) me_reconnect_fast_retry_count: u32, pub(super) me_pool_drain_ttl_secs: u64, + pub(super) me_pool_drain_soft_evict_enabled: bool, + pub(super) me_pool_drain_soft_evict_grace_secs: u64, + pub(super) me_pool_drain_soft_evict_per_writer: u8, + pub(super) me_pool_drain_soft_evict_budget_per_core: u16, + pub(super) me_pool_drain_soft_evict_cooldown_ms: u64, pub(super) me_pool_force_close_secs: u64, pub(super) me_pool_min_fresh_ratio: f32, pub(super) me_bind_stale_mode: &'static str, diff --git a/src/api/runtime_min.rs b/src/api/runtime_min.rs index d3066a3..f334dd0 100644 --- a/src/api/runtime_min.rs +++ b/src/api/runtime_min.rs @@ -113,6 +113,7 @@ pub(super) struct RuntimeMeQualityDcRttData { pub(super) rtt_ema_ms: Option, pub(super) alive_writers: usize, pub(super) required_writers: usize, + pub(super) coverage_ratio: f64, pub(super) coverage_pct: f64, } @@ -388,6 +389,7 @@ pub(super) async fn build_runtime_me_quality_data(shared: &ApiShared) -> Runtime rtt_ema_ms: dc.rtt_ms, alive_writers: dc.alive_writers, required_writers: dc.required_writers, + coverage_ratio: dc.coverage_ratio, coverage_pct: dc.coverage_pct, }) .collect(), diff --git a/src/api/runtime_stats.rs b/src/api/runtime_stats.rs index 9260c40..f8948d1 100644 --- a/src/api/runtime_stats.rs +++ b/src/api/runtime_stats.rs @@ -96,6 +96,8 @@ pub(super) fn build_zero_all_data(stats: &Stats, configured_users: usize) -> Zer pool_swap_total: stats.get_pool_swap_total(), pool_drain_active: stats.get_pool_drain_active(), pool_force_close_total: stats.get_pool_force_close_total(), + pool_drain_soft_evict_total: stats.get_pool_drain_soft_evict_total(), + pool_drain_soft_evict_writer_total: stats.get_pool_drain_soft_evict_writer_total(), pool_stale_pick_total: stats.get_pool_stale_pick_total(), writer_removed_total: stats.get_me_writer_removed_total(), writer_removed_unexpected_total: stats.get_me_writer_removed_unexpected_total(), @@ -313,6 +315,7 @@ async fn get_minimal_payload_cached( available_pct: status.available_pct, required_writers: status.required_writers, alive_writers: status.alive_writers, + coverage_ratio: status.coverage_ratio, coverage_pct: status.coverage_pct, fresh_alive_writers: status.fresh_alive_writers, fresh_coverage_pct: status.fresh_coverage_pct, @@ -370,6 +373,7 @@ async fn get_minimal_payload_cached( floor_max: entry.floor_max, floor_capped: entry.floor_capped, alive_writers: entry.alive_writers, + coverage_ratio: entry.coverage_ratio, coverage_pct: entry.coverage_pct, fresh_alive_writers: entry.fresh_alive_writers, fresh_coverage_pct: entry.fresh_coverage_pct, @@ -427,6 +431,11 @@ async fn get_minimal_payload_cached( me_reconnect_backoff_cap_ms: runtime.me_reconnect_backoff_cap_ms, me_reconnect_fast_retry_count: runtime.me_reconnect_fast_retry_count, me_pool_drain_ttl_secs: runtime.me_pool_drain_ttl_secs, + me_pool_drain_soft_evict_enabled: runtime.me_pool_drain_soft_evict_enabled, + me_pool_drain_soft_evict_grace_secs: runtime.me_pool_drain_soft_evict_grace_secs, + me_pool_drain_soft_evict_per_writer: runtime.me_pool_drain_soft_evict_per_writer, + me_pool_drain_soft_evict_budget_per_core: runtime.me_pool_drain_soft_evict_budget_per_core, + me_pool_drain_soft_evict_cooldown_ms: runtime.me_pool_drain_soft_evict_cooldown_ms, me_pool_force_close_secs: runtime.me_pool_force_close_secs, me_pool_min_fresh_ratio: runtime.me_pool_min_fresh_ratio, me_bind_stale_mode: runtime.me_bind_stale_mode, @@ -495,6 +504,7 @@ fn disabled_me_writers(now_epoch_secs: u64, reason: &'static str) -> MeWritersDa available_pct: 0.0, required_writers: 0, alive_writers: 0, + coverage_ratio: 0.0, coverage_pct: 0.0, fresh_alive_writers: 0, fresh_coverage_pct: 0.0, diff --git a/src/config/defaults.rs b/src/config/defaults.rs index 73b12d8..b36856c 100644 --- a/src/config/defaults.rs +++ b/src/config/defaults.rs @@ -27,8 +27,8 @@ const DEFAULT_ME_C2ME_CHANNEL_CAPACITY: usize = 1024; const DEFAULT_ME_READER_ROUTE_DATA_WAIT_MS: u64 = 2; const DEFAULT_ME_D2C_FLUSH_BATCH_MAX_FRAMES: usize = 32; const DEFAULT_ME_D2C_FLUSH_BATCH_MAX_BYTES: usize = 128 * 1024; -const DEFAULT_ME_D2C_FLUSH_BATCH_MAX_DELAY_US: u64 = 1500; -const DEFAULT_ME_D2C_ACK_FLUSH_IMMEDIATE: bool = false; +const DEFAULT_ME_D2C_FLUSH_BATCH_MAX_DELAY_US: u64 = 500; +const DEFAULT_ME_D2C_ACK_FLUSH_IMMEDIATE: bool = true; const DEFAULT_DIRECT_RELAY_COPY_BUF_C2S_BYTES: usize = 64 * 1024; const DEFAULT_DIRECT_RELAY_COPY_BUF_S2C_BYTES: usize = 256 * 1024; const DEFAULT_ME_WRITER_PICK_SAMPLE_SIZE: u8 = 3; @@ -36,7 +36,16 @@ const DEFAULT_ME_HEALTH_INTERVAL_MS_UNHEALTHY: u64 = 1000; const DEFAULT_ME_HEALTH_INTERVAL_MS_HEALTHY: u64 = 3000; const DEFAULT_ME_ADMISSION_POLL_MS: u64 = 1000; const DEFAULT_ME_WARN_RATE_LIMIT_MS: u64 = 5000; +const DEFAULT_ME_ROUTE_HYBRID_MAX_WAIT_MS: u64 = 3000; +const DEFAULT_ME_ROUTE_BLOCKING_SEND_TIMEOUT_MS: u64 = 250; +const DEFAULT_ME_C2ME_SEND_TIMEOUT_MS: u64 = 4000; +const DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_ENABLED: bool = true; +const DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_GRACE_SECS: u64 = 30; +const DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_PER_WRITER: u8 = 1; +const DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_BUDGET_PER_CORE: u16 = 8; +const DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_COOLDOWN_MS: u64 = 5000; const DEFAULT_USER_MAX_UNIQUE_IPS_WINDOW_SECS: u64 = 30; +const DEFAULT_ACCEPT_PERMIT_TIMEOUT_MS: u64 = 250; const DEFAULT_UPSTREAM_CONNECT_RETRY_ATTEMPTS: u32 = 2; const DEFAULT_UPSTREAM_UNHEALTHY_FAIL_THRESHOLD: u32 = 5; const DEFAULT_UPSTREAM_CONNECT_BUDGET_MS: u64 = 3000; @@ -87,11 +96,11 @@ pub(crate) fn default_connect_timeout() -> u64 { } pub(crate) fn default_keepalive() -> u64 { - 60 + 15 } pub(crate) fn default_ack_timeout() -> u64 { - 300 + 90 } pub(crate) fn default_me_one_retry() -> u8 { 12 @@ -153,6 +162,10 @@ pub(crate) fn default_server_max_connections() -> u32 { 10_000 } +pub(crate) fn default_accept_permit_timeout_ms() -> u64 { + DEFAULT_ACCEPT_PERMIT_TIMEOUT_MS +} + pub(crate) fn default_prefer_4() -> u8 { 4 } @@ -377,6 +390,18 @@ pub(crate) fn default_me_warn_rate_limit_ms() -> u64 { DEFAULT_ME_WARN_RATE_LIMIT_MS } +pub(crate) fn default_me_route_hybrid_max_wait_ms() -> u64 { + DEFAULT_ME_ROUTE_HYBRID_MAX_WAIT_MS +} + +pub(crate) fn default_me_route_blocking_send_timeout_ms() -> u64 { + DEFAULT_ME_ROUTE_BLOCKING_SEND_TIMEOUT_MS +} + +pub(crate) fn default_me_c2me_send_timeout_ms() -> u64 { + DEFAULT_ME_C2ME_SEND_TIMEOUT_MS +} + pub(crate) fn default_upstream_connect_retry_attempts() -> u32 { DEFAULT_UPSTREAM_CONNECT_RETRY_ATTEMPTS } @@ -594,6 +619,26 @@ pub(crate) fn default_me_pool_drain_threshold() -> u64 { 128 } +pub(crate) fn default_me_pool_drain_soft_evict_enabled() -> bool { + DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_ENABLED +} + +pub(crate) fn default_me_pool_drain_soft_evict_grace_secs() -> u64 { + DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_GRACE_SECS +} + +pub(crate) fn default_me_pool_drain_soft_evict_per_writer() -> u8 { + DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_PER_WRITER +} + +pub(crate) fn default_me_pool_drain_soft_evict_budget_per_core() -> u16 { + DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_BUDGET_PER_CORE +} + +pub(crate) fn default_me_pool_drain_soft_evict_cooldown_ms() -> u64 { + DEFAULT_ME_POOL_DRAIN_SOFT_EVICT_COOLDOWN_MS +} + pub(crate) fn default_me_bind_stale_ttl_secs() -> u64 { default_me_pool_drain_ttl_secs() } diff --git a/src/config/hot_reload.rs b/src/config/hot_reload.rs index d781f67..7b94999 100644 --- a/src/config/hot_reload.rs +++ b/src/config/hot_reload.rs @@ -37,7 +37,9 @@ use crate::config::{ }; use super::load::{LoadedConfig, ProxyConfig}; +const HOT_RELOAD_STABLE_SNAPSHOTS: u8 = 2; const HOT_RELOAD_DEBOUNCE: Duration = Duration::from_millis(50); +const HOT_RELOAD_STABLE_RECHECK: Duration = Duration::from_millis(75); // ── Hot fields ──────────────────────────────────────────────────────────────── @@ -55,6 +57,11 @@ pub struct HotFields { pub hardswap: bool, pub me_pool_drain_ttl_secs: u64, pub me_pool_drain_threshold: u64, + pub me_pool_drain_soft_evict_enabled: bool, + pub me_pool_drain_soft_evict_grace_secs: u64, + pub me_pool_drain_soft_evict_per_writer: u8, + pub me_pool_drain_soft_evict_budget_per_core: u16, + pub me_pool_drain_soft_evict_cooldown_ms: u64, pub me_pool_min_fresh_ratio: f32, pub me_reinit_drain_timeout_secs: u64, pub me_hardswap_warmup_delay_min_ms: u64, @@ -137,6 +144,15 @@ impl HotFields { hardswap: cfg.general.hardswap, me_pool_drain_ttl_secs: cfg.general.me_pool_drain_ttl_secs, me_pool_drain_threshold: cfg.general.me_pool_drain_threshold, + me_pool_drain_soft_evict_enabled: cfg.general.me_pool_drain_soft_evict_enabled, + me_pool_drain_soft_evict_grace_secs: cfg.general.me_pool_drain_soft_evict_grace_secs, + me_pool_drain_soft_evict_per_writer: cfg.general.me_pool_drain_soft_evict_per_writer, + me_pool_drain_soft_evict_budget_per_core: cfg + .general + .me_pool_drain_soft_evict_budget_per_core, + me_pool_drain_soft_evict_cooldown_ms: cfg + .general + .me_pool_drain_soft_evict_cooldown_ms, me_pool_min_fresh_ratio: cfg.general.me_pool_min_fresh_ratio, me_reinit_drain_timeout_secs: cfg.general.me_reinit_drain_timeout_secs, me_hardswap_warmup_delay_min_ms: cfg.general.me_hardswap_warmup_delay_min_ms, @@ -328,19 +344,49 @@ impl WatchManifest { #[derive(Debug, Default)] struct ReloadState { applied_snapshot_hash: Option, + candidate_snapshot_hash: Option, + candidate_hits: u8, } impl ReloadState { fn new(applied_snapshot_hash: Option) -> Self { - Self { applied_snapshot_hash } + Self { + applied_snapshot_hash, + candidate_snapshot_hash: None, + candidate_hits: 0, + } } fn is_applied(&self, hash: u64) -> bool { self.applied_snapshot_hash == Some(hash) } + fn observe_candidate(&mut self, hash: u64) -> u8 { + if self.candidate_snapshot_hash == Some(hash) { + self.candidate_hits = self.candidate_hits.saturating_add(1); + } else { + self.candidate_snapshot_hash = Some(hash); + self.candidate_hits = 1; + } + self.candidate_hits + } + + fn reset_candidate(&mut self) { + self.candidate_snapshot_hash = None; + self.candidate_hits = 0; + } + fn mark_applied(&mut self, hash: u64) { self.applied_snapshot_hash = Some(hash); + self.reset_candidate(); + } + + fn pending_candidate(&self) -> Option<(u64, u8)> { + let hash = self.candidate_snapshot_hash?; + if self.candidate_hits < HOT_RELOAD_STABLE_SNAPSHOTS { + return Some((hash, self.candidate_hits)); + } + None } } @@ -432,6 +478,15 @@ fn overlay_hot_fields(old: &ProxyConfig, new: &ProxyConfig) -> ProxyConfig { cfg.general.hardswap = new.general.hardswap; cfg.general.me_pool_drain_ttl_secs = new.general.me_pool_drain_ttl_secs; cfg.general.me_pool_drain_threshold = new.general.me_pool_drain_threshold; + cfg.general.me_pool_drain_soft_evict_enabled = new.general.me_pool_drain_soft_evict_enabled; + cfg.general.me_pool_drain_soft_evict_grace_secs = + new.general.me_pool_drain_soft_evict_grace_secs; + cfg.general.me_pool_drain_soft_evict_per_writer = + new.general.me_pool_drain_soft_evict_per_writer; + cfg.general.me_pool_drain_soft_evict_budget_per_core = + new.general.me_pool_drain_soft_evict_budget_per_core; + cfg.general.me_pool_drain_soft_evict_cooldown_ms = + new.general.me_pool_drain_soft_evict_cooldown_ms; cfg.general.me_pool_min_fresh_ratio = new.general.me_pool_min_fresh_ratio; cfg.general.me_reinit_drain_timeout_secs = new.general.me_reinit_drain_timeout_secs; cfg.general.me_hardswap_warmup_delay_min_ms = new.general.me_hardswap_warmup_delay_min_ms; @@ -557,6 +612,8 @@ fn warn_non_hot_changes(old: &ProxyConfig, new: &ProxyConfig, non_hot_changed: b || old.server.listen_tcp != new.server.listen_tcp || old.server.listen_unix_sock != new.server.listen_unix_sock || old.server.listen_unix_sock_perm != new.server.listen_unix_sock_perm + || old.server.max_connections != new.server.max_connections + || old.server.accept_permit_timeout_ms != new.server.accept_permit_timeout_ms { warned = true; warn!("config reload: server listener settings changed; restart required"); @@ -616,6 +673,9 @@ fn warn_non_hot_changes(old: &ProxyConfig, new: &ProxyConfig, non_hot_changed: b } if old.general.me_route_no_writer_mode != new.general.me_route_no_writer_mode || old.general.me_route_no_writer_wait_ms != new.general.me_route_no_writer_wait_ms + || old.general.me_route_hybrid_max_wait_ms != new.general.me_route_hybrid_max_wait_ms + || old.general.me_route_blocking_send_timeout_ms + != new.general.me_route_blocking_send_timeout_ms || old.general.me_route_inline_recovery_attempts != new.general.me_route_inline_recovery_attempts || old.general.me_route_inline_recovery_wait_ms @@ -624,6 +684,10 @@ fn warn_non_hot_changes(old: &ProxyConfig, new: &ProxyConfig, non_hot_changed: b warned = true; warn!("config reload: general.me_route_no_writer_* changed; restart required"); } + if old.general.me_c2me_send_timeout_ms != new.general.me_c2me_send_timeout_ms { + warned = true; + warn!("config reload: general.me_c2me_send_timeout_ms changed; restart required"); + } if old.general.unknown_dc_log_path != new.general.unknown_dc_log_path || old.general.unknown_dc_file_log_enabled != new.general.unknown_dc_file_log_enabled { @@ -812,6 +876,25 @@ fn log_changes( old_hot.me_pool_drain_threshold, new_hot.me_pool_drain_threshold, ); } + if old_hot.me_pool_drain_soft_evict_enabled != new_hot.me_pool_drain_soft_evict_enabled + || old_hot.me_pool_drain_soft_evict_grace_secs + != new_hot.me_pool_drain_soft_evict_grace_secs + || old_hot.me_pool_drain_soft_evict_per_writer + != new_hot.me_pool_drain_soft_evict_per_writer + || old_hot.me_pool_drain_soft_evict_budget_per_core + != new_hot.me_pool_drain_soft_evict_budget_per_core + || old_hot.me_pool_drain_soft_evict_cooldown_ms + != new_hot.me_pool_drain_soft_evict_cooldown_ms + { + info!( + "config reload: me_pool_drain_soft_evict: enabled={} grace={}s per_writer={} budget_per_core={} cooldown={}ms", + new_hot.me_pool_drain_soft_evict_enabled, + new_hot.me_pool_drain_soft_evict_grace_secs, + new_hot.me_pool_drain_soft_evict_per_writer, + new_hot.me_pool_drain_soft_evict_budget_per_core, + new_hot.me_pool_drain_soft_evict_cooldown_ms + ); + } if (old_hot.me_pool_min_fresh_ratio - new_hot.me_pool_min_fresh_ratio).abs() > f32::EPSILON { info!( @@ -1115,6 +1198,7 @@ fn reload_config( let loaded = match ProxyConfig::load_with_metadata(config_path) { Ok(loaded) => loaded, Err(e) => { + reload_state.reset_candidate(); error!("config reload: failed to parse {:?}: {}", config_path, e); return None; } @@ -1127,6 +1211,7 @@ fn reload_config( let next_manifest = WatchManifest::from_source_files(&source_files); if let Err(e) = new_cfg.validate() { + reload_state.reset_candidate(); error!("config reload: validation failed: {}; keeping old config", e); return Some(next_manifest); } @@ -1135,6 +1220,17 @@ fn reload_config( return Some(next_manifest); } + let candidate_hits = reload_state.observe_candidate(rendered_hash); + if candidate_hits < HOT_RELOAD_STABLE_SNAPSHOTS { + info!( + snapshot_hash = rendered_hash, + candidate_hits, + required_hits = HOT_RELOAD_STABLE_SNAPSHOTS, + "config reload: candidate snapshot observed but not stable yet" + ); + return Some(next_manifest); + } + let old_cfg = config_tx.borrow().clone(); let applied_cfg = overlay_hot_fields(&old_cfg, &new_cfg); let old_hot = HotFields::from_config(&old_cfg); @@ -1154,6 +1250,7 @@ fn reload_config( if old_hot.dns_overrides != applied_hot.dns_overrides && let Err(e) = crate::network::dns_overrides::install_entries(&applied_hot.dns_overrides) { + reload_state.reset_candidate(); error!( "config reload: invalid network.dns_overrides: {}; keeping old config", e @@ -1174,6 +1271,73 @@ fn reload_config( Some(next_manifest) } +async fn reload_with_internal_stable_rechecks( + config_path: &PathBuf, + config_tx: &watch::Sender>, + log_tx: &watch::Sender, + detected_ip_v4: Option, + detected_ip_v6: Option, + reload_state: &mut ReloadState, +) -> Option { + let mut next_manifest = reload_config( + config_path, + config_tx, + log_tx, + detected_ip_v4, + detected_ip_v6, + reload_state, + ); + let mut rechecks_left = HOT_RELOAD_STABLE_SNAPSHOTS.saturating_sub(1); + + while rechecks_left > 0 { + let Some((snapshot_hash, candidate_hits)) = reload_state.pending_candidate() else { + break; + }; + + info!( + snapshot_hash, + candidate_hits, + required_hits = HOT_RELOAD_STABLE_SNAPSHOTS, + rechecks_left, + recheck_delay_ms = HOT_RELOAD_STABLE_RECHECK.as_millis(), + "config reload: scheduling internal stable recheck" + ); + tokio::time::sleep(HOT_RELOAD_STABLE_RECHECK).await; + + let recheck_manifest = reload_config( + config_path, + config_tx, + log_tx, + detected_ip_v4, + detected_ip_v6, + reload_state, + ); + if recheck_manifest.is_some() { + next_manifest = recheck_manifest; + } + + if reload_state.is_applied(snapshot_hash) { + info!( + snapshot_hash, + "config reload: applied after internal stable recheck" + ); + break; + } + + if reload_state.pending_candidate().is_none() { + info!( + snapshot_hash, + "config reload: internal stable recheck aborted" + ); + break; + } + + rechecks_left = rechecks_left.saturating_sub(1); + } + + next_manifest +} + // ── Public API ──────────────────────────────────────────────────────────────── /// Spawn the hot-reload watcher task. @@ -1297,28 +1461,16 @@ pub fn spawn_config_watcher( tokio::time::sleep(HOT_RELOAD_DEBOUNCE).await; while notify_rx.try_recv().is_ok() {} - let mut next_manifest = reload_config( + if let Some(next_manifest) = reload_with_internal_stable_rechecks( &config_path, &config_tx, &log_tx, detected_ip_v4, detected_ip_v6, &mut reload_state, - ); - if next_manifest.is_none() { - tokio::time::sleep(HOT_RELOAD_DEBOUNCE).await; - while notify_rx.try_recv().is_ok() {} - next_manifest = reload_config( - &config_path, - &config_tx, - &log_tx, - detected_ip_v4, - detected_ip_v6, - &mut reload_state, - ); - } - - if let Some(next_manifest) = next_manifest { + ) + .await + { apply_watch_manifest( inotify_watcher.as_mut(), poll_watcher.as_mut(), @@ -1443,7 +1595,7 @@ mod tests { } #[test] - fn reload_applies_hot_change_on_first_observed_snapshot() { + fn reload_requires_stable_snapshot_before_hot_apply() { let initial_tag = "11111111111111111111111111111111"; let final_tag = "22222222222222222222222222222222"; let path = temp_config_path("telemt_hot_reload_stable"); @@ -1455,13 +1607,55 @@ mod tests { let (log_tx, _log_rx) = watch::channel(initial_cfg.general.log_level.clone()); let mut reload_state = ReloadState::new(Some(initial_hash)); + write_reload_config(&path, None, None); + reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); + assert_eq!( + config_tx.borrow().general.ad_tag.as_deref(), + Some(initial_tag) + ); + write_reload_config(&path, Some(final_tag), None); + reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); + assert_eq!( + config_tx.borrow().general.ad_tag.as_deref(), + Some(initial_tag) + ); + reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); assert_eq!(config_tx.borrow().general.ad_tag.as_deref(), Some(final_tag)); let _ = std::fs::remove_file(path); } + #[tokio::test] + async fn reload_cycle_applies_after_single_external_event() { + let initial_tag = "10101010101010101010101010101010"; + let final_tag = "20202020202020202020202020202020"; + let path = temp_config_path("telemt_hot_reload_single_event"); + + write_reload_config(&path, Some(initial_tag), None); + let initial_cfg = Arc::new(ProxyConfig::load(&path).unwrap()); + let initial_hash = ProxyConfig::load_with_metadata(&path).unwrap().rendered_hash; + let (config_tx, _config_rx) = watch::channel(initial_cfg.clone()); + let (log_tx, _log_rx) = watch::channel(initial_cfg.general.log_level.clone()); + let mut reload_state = ReloadState::new(Some(initial_hash)); + + write_reload_config(&path, Some(final_tag), None); + reload_with_internal_stable_rechecks( + &path, + &config_tx, + &log_tx, + None, + None, + &mut reload_state, + ) + .await + .unwrap(); + + assert_eq!(config_tx.borrow().general.ad_tag.as_deref(), Some(final_tag)); + let _ = std::fs::remove_file(path); + } + #[test] fn reload_keeps_hot_apply_when_non_hot_fields_change() { let initial_tag = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; @@ -1477,6 +1671,7 @@ mod tests { write_reload_config(&path, Some(final_tag), Some(initial_cfg.server.port + 1)); reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); + reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); let applied = config_tx.borrow().clone(); assert_eq!(applied.general.ad_tag.as_deref(), Some(final_tag)); @@ -1484,31 +1679,4 @@ mod tests { let _ = std::fs::remove_file(path); } - - #[test] - fn reload_recovers_after_parse_error_on_next_attempt() { - let initial_tag = "cccccccccccccccccccccccccccccccc"; - let final_tag = "dddddddddddddddddddddddddddddddd"; - let path = temp_config_path("telemt_hot_reload_parse_recovery"); - - write_reload_config(&path, Some(initial_tag), None); - let initial_cfg = Arc::new(ProxyConfig::load(&path).unwrap()); - let initial_hash = ProxyConfig::load_with_metadata(&path).unwrap().rendered_hash; - let (config_tx, _config_rx) = watch::channel(initial_cfg.clone()); - let (log_tx, _log_rx) = watch::channel(initial_cfg.general.log_level.clone()); - let mut reload_state = ReloadState::new(Some(initial_hash)); - - std::fs::write(&path, "[access.users\nuser = \"broken\"\n").unwrap(); - assert!(reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).is_none()); - assert_eq!( - config_tx.borrow().general.ad_tag.as_deref(), - Some(initial_tag) - ); - - write_reload_config(&path, Some(final_tag), None); - reload_config(&path, &config_tx, &log_tx, None, None, &mut reload_state).unwrap(); - assert_eq!(config_tx.borrow().general.ad_tag.as_deref(), Some(final_tag)); - - let _ = std::fs::remove_file(path); - } } diff --git a/src/config/load.rs b/src/config/load.rs index ed3e303..0635f80 100644 --- a/src/config/load.rs +++ b/src/config/load.rs @@ -346,6 +346,12 @@ impl ProxyConfig { )); } + if config.general.me_c2me_send_timeout_ms > 60_000 { + return Err(ProxyError::Config( + "general.me_c2me_send_timeout_ms must be within [0, 60000]".to_string(), + )); + } + if config.general.me_reader_route_data_wait_ms > 20 { return Err(ProxyError::Config( "general.me_reader_route_data_wait_ms must be within [0, 20]".to_string(), @@ -406,6 +412,35 @@ impl ProxyConfig { )); } + if config.general.me_pool_drain_soft_evict_grace_secs > 3600 { + return Err(ProxyError::Config( + "general.me_pool_drain_soft_evict_grace_secs must be within [0, 3600]".to_string(), + )); + } + + if config.general.me_pool_drain_soft_evict_per_writer == 0 + || config.general.me_pool_drain_soft_evict_per_writer > 16 + { + return Err(ProxyError::Config( + "general.me_pool_drain_soft_evict_per_writer must be within [1, 16]".to_string(), + )); + } + + if config.general.me_pool_drain_soft_evict_budget_per_core == 0 + || config.general.me_pool_drain_soft_evict_budget_per_core > 64 + { + return Err(ProxyError::Config( + "general.me_pool_drain_soft_evict_budget_per_core must be within [1, 64]" + .to_string(), + )); + } + + if config.general.me_pool_drain_soft_evict_cooldown_ms == 0 { + return Err(ProxyError::Config( + "general.me_pool_drain_soft_evict_cooldown_ms must be > 0".to_string(), + )); + } + if config.access.user_max_unique_ips_window_secs == 0 { return Err(ProxyError::Config( "access.user_max_unique_ips_window_secs must be > 0".to_string(), @@ -598,6 +633,18 @@ impl ProxyConfig { )); } + if !(50..=60_000).contains(&config.general.me_route_hybrid_max_wait_ms) { + return Err(ProxyError::Config( + "general.me_route_hybrid_max_wait_ms must be within [50, 60000]".to_string(), + )); + } + + if config.general.me_route_blocking_send_timeout_ms > 5000 { + return Err(ProxyError::Config( + "general.me_route_blocking_send_timeout_ms must be within [0, 5000]".to_string(), + )); + } + if !(2..=4).contains(&config.general.me_writer_pick_sample_size) { return Err(ProxyError::Config( "general.me_writer_pick_sample_size must be within [2, 4]".to_string(), @@ -658,6 +705,12 @@ impl ProxyConfig { )); } + if config.server.accept_permit_timeout_ms > 60_000 { + return Err(ProxyError::Config( + "server.accept_permit_timeout_ms must be within [0, 60000]".to_string(), + )); + } + if config.general.effective_me_pool_force_close_secs() > 0 && config.general.effective_me_pool_force_close_secs() < config.general.me_pool_drain_ttl_secs diff --git a/src/config/types.rs b/src/config/types.rs index 0c5f09b..1f6078f 100644 --- a/src/config/types.rs +++ b/src/config/types.rs @@ -462,6 +462,11 @@ pub struct GeneralConfig { #[serde(default = "default_me_c2me_channel_capacity")] pub me_c2me_channel_capacity: usize, + /// Maximum wait in milliseconds for enqueueing C2ME commands when the queue is full. + /// `0` keeps legacy unbounded wait behavior. + #[serde(default = "default_me_c2me_send_timeout_ms")] + pub me_c2me_send_timeout_ms: u64, + /// Bounded wait in milliseconds for routing ME DATA to per-connection queue. /// `0` keeps legacy no-wait behavior. #[serde(default = "default_me_reader_route_data_wait_ms")] @@ -716,6 +721,15 @@ pub struct GeneralConfig { #[serde(default = "default_me_route_no_writer_wait_ms")] pub me_route_no_writer_wait_ms: u64, + /// Maximum cumulative wait in milliseconds for hybrid no-writer mode before failfast. + #[serde(default = "default_me_route_hybrid_max_wait_ms")] + pub me_route_hybrid_max_wait_ms: u64, + + /// Maximum wait in milliseconds for blocking ME writer channel send fallback. + /// `0` keeps legacy unbounded wait behavior. + #[serde(default = "default_me_route_blocking_send_timeout_ms")] + pub me_route_blocking_send_timeout_ms: u64, + /// Number of inline recovery attempts in legacy mode. #[serde(default = "default_me_route_inline_recovery_attempts")] pub me_route_inline_recovery_attempts: u32, @@ -803,6 +817,26 @@ pub struct GeneralConfig { #[serde(default = "default_me_pool_drain_threshold")] pub me_pool_drain_threshold: u64, + /// Enable staged client eviction for draining ME writers that remain non-empty past TTL. + #[serde(default = "default_me_pool_drain_soft_evict_enabled")] + pub me_pool_drain_soft_evict_enabled: bool, + + /// Extra grace in seconds after drain TTL before soft-eviction stage starts. + #[serde(default = "default_me_pool_drain_soft_evict_grace_secs")] + pub me_pool_drain_soft_evict_grace_secs: u64, + + /// Maximum number of client sessions to evict from one draining writer per health tick. + #[serde(default = "default_me_pool_drain_soft_evict_per_writer")] + pub me_pool_drain_soft_evict_per_writer: u8, + + /// Soft-eviction budget per CPU core for one health tick. + #[serde(default = "default_me_pool_drain_soft_evict_budget_per_core")] + pub me_pool_drain_soft_evict_budget_per_core: u16, + + /// Cooldown for repetitive soft-eviction on the same writer in milliseconds. + #[serde(default = "default_me_pool_drain_soft_evict_cooldown_ms")] + pub me_pool_drain_soft_evict_cooldown_ms: u64, + /// Policy for new binds on stale draining writers. #[serde(default)] pub me_bind_stale_mode: MeBindStaleMode, @@ -901,6 +935,7 @@ impl Default for GeneralConfig { me_writer_cmd_channel_capacity: default_me_writer_cmd_channel_capacity(), me_route_channel_capacity: default_me_route_channel_capacity(), me_c2me_channel_capacity: default_me_c2me_channel_capacity(), + me_c2me_send_timeout_ms: default_me_c2me_send_timeout_ms(), me_reader_route_data_wait_ms: default_me_reader_route_data_wait_ms(), me_d2c_flush_batch_max_frames: default_me_d2c_flush_batch_max_frames(), me_d2c_flush_batch_max_bytes: default_me_d2c_flush_batch_max_bytes(), @@ -955,6 +990,8 @@ impl Default for GeneralConfig { me_warn_rate_limit_ms: default_me_warn_rate_limit_ms(), me_route_no_writer_mode: MeRouteNoWriterMode::default(), me_route_no_writer_wait_ms: default_me_route_no_writer_wait_ms(), + me_route_hybrid_max_wait_ms: default_me_route_hybrid_max_wait_ms(), + me_route_blocking_send_timeout_ms: default_me_route_blocking_send_timeout_ms(), me_route_inline_recovery_attempts: default_me_route_inline_recovery_attempts(), me_route_inline_recovery_wait_ms: default_me_route_inline_recovery_wait_ms(), links: LinksConfig::default(), @@ -984,6 +1021,13 @@ impl Default for GeneralConfig { proxy_secret_len_max: default_proxy_secret_len_max(), me_pool_drain_ttl_secs: default_me_pool_drain_ttl_secs(), me_pool_drain_threshold: default_me_pool_drain_threshold(), + me_pool_drain_soft_evict_enabled: default_me_pool_drain_soft_evict_enabled(), + me_pool_drain_soft_evict_grace_secs: default_me_pool_drain_soft_evict_grace_secs(), + me_pool_drain_soft_evict_per_writer: default_me_pool_drain_soft_evict_per_writer(), + me_pool_drain_soft_evict_budget_per_core: + default_me_pool_drain_soft_evict_budget_per_core(), + me_pool_drain_soft_evict_cooldown_ms: + default_me_pool_drain_soft_evict_cooldown_ms(), me_bind_stale_mode: MeBindStaleMode::default(), me_bind_stale_ttl_secs: default_me_bind_stale_ttl_secs(), me_pool_min_fresh_ratio: default_me_pool_min_fresh_ratio(), @@ -1187,6 +1231,11 @@ pub struct ServerConfig { /// 0 means unlimited. #[serde(default = "default_server_max_connections")] pub max_connections: u32, + + /// Maximum wait in milliseconds while acquiring a connection slot permit. + /// `0` keeps legacy unbounded wait behavior. + #[serde(default = "default_accept_permit_timeout_ms")] + pub accept_permit_timeout_ms: u64, } impl Default for ServerConfig { @@ -1207,6 +1256,7 @@ impl Default for ServerConfig { api: ApiConfig::default(), listeners: Vec::new(), max_connections: default_server_max_connections(), + accept_permit_timeout_ms: default_accept_permit_timeout_ms(), } } } diff --git a/src/maestro/helpers.rs b/src/maestro/helpers.rs index 029d0ee..f916633 100644 --- a/src/maestro/helpers.rs +++ b/src/maestro/helpers.rs @@ -253,6 +253,7 @@ pub(crate) fn format_uptime(total_secs: u64) -> String { format!("{} / {} seconds", parts.join(", "), total_secs) } +#[allow(dead_code)] pub(crate) async fn wait_until_admission_open(admission_rx: &mut watch::Receiver) -> bool { loop { if *admission_rx.borrow() { diff --git a/src/maestro/listeners.rs b/src/maestro/listeners.rs index 6296fd7..fe041d9 100644 --- a/src/maestro/listeners.rs +++ b/src/maestro/listeners.rs @@ -24,7 +24,7 @@ use crate::transport::{ ListenOptions, UpstreamManager, create_listener, find_listener_processes, }; -use super::helpers::{is_expected_handshake_eof, print_proxy_links, wait_until_admission_open}; +use super::helpers::{is_expected_handshake_eof, print_proxy_links}; pub(crate) struct BoundListeners { pub(crate) listeners: Vec<(TcpListener, bool)>, @@ -195,7 +195,7 @@ pub(crate) async fn bind_listeners( has_unix_listener = true; let mut config_rx_unix: watch::Receiver> = config_rx.clone(); - let mut admission_rx_unix = admission_rx.clone(); + let admission_rx_unix = admission_rx.clone(); let stats = stats.clone(); let upstream_manager = upstream_manager.clone(); let replay_checker = replay_checker.clone(); @@ -212,17 +212,44 @@ pub(crate) async fn bind_listeners( let unix_conn_counter = Arc::new(std::sync::atomic::AtomicU64::new(1)); loop { - if !wait_until_admission_open(&mut admission_rx_unix).await { - warn!("Conditional-admission gate channel closed for unix listener"); - break; - } match unix_listener.accept().await { Ok((stream, _)) => { - let permit = match max_connections_unix.clone().acquire_owned().await { - Ok(permit) => permit, - Err(_) => { - error!("Connection limiter is closed"); - break; + if !*admission_rx_unix.borrow() { + drop(stream); + continue; + } + let accept_permit_timeout_ms = config_rx_unix + .borrow() + .server + .accept_permit_timeout_ms; + let permit = if accept_permit_timeout_ms == 0 { + match max_connections_unix.clone().acquire_owned().await { + Ok(permit) => permit, + Err(_) => { + error!("Connection limiter is closed"); + break; + } + } + } else { + match tokio::time::timeout( + Duration::from_millis(accept_permit_timeout_ms), + max_connections_unix.clone().acquire_owned(), + ) + .await + { + Ok(Ok(permit)) => permit, + Ok(Err(_)) => { + error!("Connection limiter is closed"); + break; + } + Err(_) => { + debug!( + timeout_ms = accept_permit_timeout_ms, + "Dropping accepted unix connection: permit wait timeout" + ); + drop(stream); + continue; + } } }; let conn_id = @@ -312,7 +339,7 @@ pub(crate) fn spawn_tcp_accept_loops( ) { for (listener, listener_proxy_protocol) in listeners { let mut config_rx: watch::Receiver> = config_rx.clone(); - let mut admission_rx_tcp = admission_rx.clone(); + let admission_rx_tcp = admission_rx.clone(); let stats = stats.clone(); let upstream_manager = upstream_manager.clone(); let replay_checker = replay_checker.clone(); @@ -327,17 +354,46 @@ pub(crate) fn spawn_tcp_accept_loops( tokio::spawn(async move { loop { - if !wait_until_admission_open(&mut admission_rx_tcp).await { - warn!("Conditional-admission gate channel closed for tcp listener"); - break; - } match listener.accept().await { Ok((stream, peer_addr)) => { - let permit = match max_connections_tcp.clone().acquire_owned().await { - Ok(permit) => permit, - Err(_) => { - error!("Connection limiter is closed"); - break; + if !*admission_rx_tcp.borrow() { + debug!(peer = %peer_addr, "Admission gate closed, dropping connection"); + drop(stream); + continue; + } + let accept_permit_timeout_ms = config_rx + .borrow() + .server + .accept_permit_timeout_ms; + let permit = if accept_permit_timeout_ms == 0 { + match max_connections_tcp.clone().acquire_owned().await { + Ok(permit) => permit, + Err(_) => { + error!("Connection limiter is closed"); + break; + } + } + } else { + match tokio::time::timeout( + Duration::from_millis(accept_permit_timeout_ms), + max_connections_tcp.clone().acquire_owned(), + ) + .await + { + Ok(Ok(permit)) => permit, + Ok(Err(_)) => { + error!("Connection limiter is closed"); + break; + } + Err(_) => { + debug!( + peer = %peer_addr, + timeout_ms = accept_permit_timeout_ms, + "Dropping accepted connection: permit wait timeout" + ); + drop(stream); + continue; + } } }; let config = config_rx.borrow_and_update().clone(); diff --git a/src/maestro/me_startup.rs b/src/maestro/me_startup.rs index 245c7a9..827b00c 100644 --- a/src/maestro/me_startup.rs +++ b/src/maestro/me_startup.rs @@ -238,6 +238,11 @@ pub(crate) async fn initialize_me_pool( config.general.hardswap, config.general.me_pool_drain_ttl_secs, config.general.me_pool_drain_threshold, + config.general.me_pool_drain_soft_evict_enabled, + config.general.me_pool_drain_soft_evict_grace_secs, + config.general.me_pool_drain_soft_evict_per_writer, + config.general.me_pool_drain_soft_evict_budget_per_core, + config.general.me_pool_drain_soft_evict_cooldown_ms, config.general.effective_me_pool_force_close_secs(), config.general.me_pool_min_fresh_ratio, config.general.me_hardswap_warmup_delay_min_ms, @@ -262,6 +267,8 @@ pub(crate) async fn initialize_me_pool( config.general.me_warn_rate_limit_ms, config.general.me_route_no_writer_mode, config.general.me_route_no_writer_wait_ms, + config.general.me_route_hybrid_max_wait_ms, + config.general.me_route_blocking_send_timeout_ms, config.general.me_route_inline_recovery_attempts, config.general.me_route_inline_recovery_wait_ms, ); diff --git a/src/maestro/mod.rs b/src/maestro/mod.rs index d4ce2e0..7ba7b39 100644 --- a/src/maestro/mod.rs +++ b/src/maestro/mod.rs @@ -484,7 +484,7 @@ pub async fn run() -> std::result::Result<(), Box> { Duration::from_secs(config.access.replay_window_secs), )); - let buffer_pool = Arc::new(BufferPool::with_config(16 * 1024, 4096)); + let buffer_pool = Arc::new(BufferPool::with_config(64 * 1024, 4096)); connectivity::run_startup_connectivity( &config, diff --git a/src/metrics.rs b/src/metrics.rs index f4f8a2e..3de9896 100644 --- a/src/metrics.rs +++ b/src/metrics.rs @@ -292,6 +292,109 @@ async fn render_metrics(stats: &Stats, config: &ProxyConfig, ip_tracker: &UserIp "telemt_connections_bad_total {}", if core_enabled { stats.get_connects_bad() } else { 0 } ); + let _ = writeln!(out, "# HELP telemt_connections_current Current active connections"); + let _ = writeln!(out, "# TYPE telemt_connections_current gauge"); + let _ = writeln!( + out, + "telemt_connections_current {}", + if core_enabled { + stats.get_current_connections_total() + } else { + 0 + } + ); + let _ = writeln!(out, "# HELP telemt_connections_direct_current Current active direct connections"); + let _ = writeln!(out, "# TYPE telemt_connections_direct_current gauge"); + let _ = writeln!( + out, + "telemt_connections_direct_current {}", + if core_enabled { + stats.get_current_connections_direct() + } else { + 0 + } + ); + let _ = writeln!(out, "# HELP telemt_connections_me_current Current active middle-end connections"); + let _ = writeln!(out, "# TYPE telemt_connections_me_current gauge"); + let _ = writeln!( + out, + "telemt_connections_me_current {}", + if core_enabled { + stats.get_current_connections_me() + } else { + 0 + } + ); + let _ = writeln!( + out, + "# HELP telemt_relay_adaptive_promotions_total Adaptive relay tier promotions" + ); + let _ = writeln!(out, "# TYPE telemt_relay_adaptive_promotions_total counter"); + let _ = writeln!( + out, + "telemt_relay_adaptive_promotions_total {}", + if core_enabled { + stats.get_relay_adaptive_promotions_total() + } else { + 0 + } + ); + let _ = writeln!( + out, + "# HELP telemt_relay_adaptive_demotions_total Adaptive relay tier demotions" + ); + let _ = writeln!(out, "# TYPE telemt_relay_adaptive_demotions_total counter"); + let _ = writeln!( + out, + "telemt_relay_adaptive_demotions_total {}", + if core_enabled { + stats.get_relay_adaptive_demotions_total() + } else { + 0 + } + ); + let _ = writeln!( + out, + "# HELP telemt_relay_adaptive_hard_promotions_total Adaptive relay hard promotions triggered by write pressure" + ); + let _ = writeln!( + out, + "# TYPE telemt_relay_adaptive_hard_promotions_total counter" + ); + let _ = writeln!( + out, + "telemt_relay_adaptive_hard_promotions_total {}", + if core_enabled { + stats.get_relay_adaptive_hard_promotions_total() + } else { + 0 + } + ); + let _ = writeln!(out, "# HELP telemt_reconnect_evict_total Reconnect-driven session evictions"); + let _ = writeln!(out, "# TYPE telemt_reconnect_evict_total counter"); + let _ = writeln!( + out, + "telemt_reconnect_evict_total {}", + if core_enabled { + stats.get_reconnect_evict_total() + } else { + 0 + } + ); + let _ = writeln!( + out, + "# HELP telemt_reconnect_stale_close_total Sessions closed because they became stale after reconnect" + ); + let _ = writeln!(out, "# TYPE telemt_reconnect_stale_close_total counter"); + let _ = writeln!( + out, + "telemt_reconnect_stale_close_total {}", + if core_enabled { + stats.get_reconnect_stale_close_total() + } else { + 0 + } + ); let _ = writeln!(out, "# HELP telemt_handshake_timeouts_total Handshake timeouts"); let _ = writeln!(out, "# TYPE telemt_handshake_timeouts_total counter"); @@ -1547,6 +1650,36 @@ async fn render_metrics(stats: &Stats, config: &ProxyConfig, ip_tracker: &UserIp } ); + let _ = writeln!( + out, + "# HELP telemt_pool_drain_soft_evict_total Soft-evicted client sessions on stuck draining writers" + ); + let _ = writeln!(out, "# TYPE telemt_pool_drain_soft_evict_total counter"); + let _ = writeln!( + out, + "telemt_pool_drain_soft_evict_total {}", + if me_allows_normal { + stats.get_pool_drain_soft_evict_total() + } else { + 0 + } + ); + + let _ = writeln!( + out, + "# HELP telemt_pool_drain_soft_evict_writer_total Draining writers with at least one soft eviction" + ); + let _ = writeln!(out, "# TYPE telemt_pool_drain_soft_evict_writer_total counter"); + let _ = writeln!( + out, + "telemt_pool_drain_soft_evict_writer_total {}", + if me_allows_normal { + stats.get_pool_drain_soft_evict_writer_total() + } else { + 0 + } + ); + let _ = writeln!(out, "# HELP telemt_pool_stale_pick_total Stale writer fallback picks for new binds"); let _ = writeln!(out, "# TYPE telemt_pool_stale_pick_total counter"); let _ = writeln!( @@ -1864,6 +1997,8 @@ mod tests { stats.increment_connects_all(); stats.increment_connects_all(); stats.increment_connects_bad(); + stats.increment_current_connections_direct(); + stats.increment_current_connections_me(); stats.increment_handshake_timeouts(); stats.increment_upstream_connect_attempt_total(); stats.increment_upstream_connect_attempt_total(); @@ -1895,6 +2030,9 @@ mod tests { assert!(output.contains("telemt_connections_total 2")); assert!(output.contains("telemt_connections_bad_total 1")); + assert!(output.contains("telemt_connections_current 2")); + assert!(output.contains("telemt_connections_direct_current 1")); + assert!(output.contains("telemt_connections_me_current 1")); assert!(output.contains("telemt_handshake_timeouts_total 1")); assert!(output.contains("telemt_upstream_connect_attempt_total 2")); assert!(output.contains("telemt_upstream_connect_success_total 1")); @@ -1937,6 +2075,9 @@ mod tests { let output = render_metrics(&stats, &config, &tracker).await; assert!(output.contains("telemt_connections_total 0")); assert!(output.contains("telemt_connections_bad_total 0")); + assert!(output.contains("telemt_connections_current 0")); + assert!(output.contains("telemt_connections_direct_current 0")); + assert!(output.contains("telemt_connections_me_current 0")); assert!(output.contains("telemt_handshake_timeouts_total 0")); assert!(output.contains("telemt_user_unique_ips_current{user=")); assert!(output.contains("telemt_user_unique_ips_recent_window{user=")); @@ -1970,11 +2111,21 @@ mod tests { assert!(output.contains("# TYPE telemt_uptime_seconds gauge")); assert!(output.contains("# TYPE telemt_connections_total counter")); assert!(output.contains("# TYPE telemt_connections_bad_total counter")); + assert!(output.contains("# TYPE telemt_connections_current gauge")); + assert!(output.contains("# TYPE telemt_connections_direct_current gauge")); + assert!(output.contains("# TYPE telemt_connections_me_current gauge")); + assert!(output.contains("# TYPE telemt_relay_adaptive_promotions_total counter")); + assert!(output.contains("# TYPE telemt_relay_adaptive_demotions_total counter")); + assert!(output.contains("# TYPE telemt_relay_adaptive_hard_promotions_total counter")); + assert!(output.contains("# TYPE telemt_reconnect_evict_total counter")); + assert!(output.contains("# TYPE telemt_reconnect_stale_close_total counter")); assert!(output.contains("# TYPE telemt_handshake_timeouts_total counter")); assert!(output.contains("# TYPE telemt_upstream_connect_attempt_total counter")); assert!(output.contains("# TYPE telemt_me_rpc_proxy_req_signal_sent_total counter")); assert!(output.contains("# TYPE telemt_me_idle_close_by_peer_total counter")); assert!(output.contains("# TYPE telemt_me_writer_removed_total counter")); + assert!(output.contains("# TYPE telemt_pool_drain_soft_evict_total counter")); + assert!(output.contains("# TYPE telemt_pool_drain_soft_evict_writer_total counter")); assert!(output.contains( "# TYPE telemt_me_writer_removed_unexpected_minus_restored_total gauge" )); diff --git a/src/proxy/adaptive_buffers.rs b/src/proxy/adaptive_buffers.rs new file mode 100644 index 0000000..3b1bce9 --- /dev/null +++ b/src/proxy/adaptive_buffers.rs @@ -0,0 +1,383 @@ +use dashmap::DashMap; +use std::cmp::max; +use std::sync::OnceLock; +use std::time::{Duration, Instant}; + +const EMA_ALPHA: f64 = 0.2; +const PROFILE_TTL: Duration = Duration::from_secs(300); +const THROUGHPUT_UP_BPS: f64 = 8_000_000.0; +const THROUGHPUT_DOWN_BPS: f64 = 2_000_000.0; +const RATIO_CONFIRM_THRESHOLD: f64 = 1.12; +const TIER1_HOLD_TICKS: u32 = 8; +const TIER2_HOLD_TICKS: u32 = 4; +const QUIET_DEMOTE_TICKS: u32 = 480; +const HARD_COOLDOWN_TICKS: u32 = 20; +const HARD_PENDING_THRESHOLD: u32 = 3; +const HARD_PARTIAL_RATIO_THRESHOLD: f64 = 0.25; +const DIRECT_C2S_CAP_BYTES: usize = 128 * 1024; +const DIRECT_S2C_CAP_BYTES: usize = 512 * 1024; +const ME_FRAMES_CAP: usize = 96; +const ME_BYTES_CAP: usize = 384 * 1024; +const ME_DELAY_MIN_US: u64 = 150; + +#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] +pub enum AdaptiveTier { + Base = 0, + Tier1 = 1, + Tier2 = 2, + Tier3 = 3, +} + +impl AdaptiveTier { + pub fn promote(self) -> Self { + match self { + Self::Base => Self::Tier1, + Self::Tier1 => Self::Tier2, + Self::Tier2 => Self::Tier3, + Self::Tier3 => Self::Tier3, + } + } + + pub fn demote(self) -> Self { + match self { + Self::Base => Self::Base, + Self::Tier1 => Self::Base, + Self::Tier2 => Self::Tier1, + Self::Tier3 => Self::Tier2, + } + } + + fn ratio(self) -> (usize, usize) { + match self { + Self::Base => (1, 1), + Self::Tier1 => (5, 4), + Self::Tier2 => (3, 2), + Self::Tier3 => (2, 1), + } + } + + pub fn as_u8(self) -> u8 { + self as u8 + } +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum TierTransitionReason { + SoftConfirmed, + HardPressure, + QuietDemotion, +} + +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub struct TierTransition { + pub from: AdaptiveTier, + pub to: AdaptiveTier, + pub reason: TierTransitionReason, +} + +#[derive(Debug, Clone, Copy, Default)] +pub struct RelaySignalSample { + pub c2s_bytes: u64, + pub s2c_requested_bytes: u64, + pub s2c_written_bytes: u64, + pub s2c_write_ops: u64, + pub s2c_partial_writes: u64, + pub s2c_consecutive_pending_writes: u32, +} + +#[derive(Debug, Clone, Copy)] +pub struct SessionAdaptiveController { + tier: AdaptiveTier, + max_tier_seen: AdaptiveTier, + throughput_ema_bps: f64, + incoming_ema_bps: f64, + outgoing_ema_bps: f64, + tier1_hold_ticks: u32, + tier2_hold_ticks: u32, + quiet_ticks: u32, + hard_cooldown_ticks: u32, +} + +impl SessionAdaptiveController { + pub fn new(initial_tier: AdaptiveTier) -> Self { + Self { + tier: initial_tier, + max_tier_seen: initial_tier, + throughput_ema_bps: 0.0, + incoming_ema_bps: 0.0, + outgoing_ema_bps: 0.0, + tier1_hold_ticks: 0, + tier2_hold_ticks: 0, + quiet_ticks: 0, + hard_cooldown_ticks: 0, + } + } + + pub fn max_tier_seen(&self) -> AdaptiveTier { + self.max_tier_seen + } + + pub fn observe(&mut self, sample: RelaySignalSample, tick_secs: f64) -> Option { + if tick_secs <= f64::EPSILON { + return None; + } + + if self.hard_cooldown_ticks > 0 { + self.hard_cooldown_ticks -= 1; + } + + let c2s_bps = (sample.c2s_bytes as f64 * 8.0) / tick_secs; + let incoming_bps = (sample.s2c_requested_bytes as f64 * 8.0) / tick_secs; + let outgoing_bps = (sample.s2c_written_bytes as f64 * 8.0) / tick_secs; + let throughput = c2s_bps.max(outgoing_bps); + + self.throughput_ema_bps = ema(self.throughput_ema_bps, throughput); + self.incoming_ema_bps = ema(self.incoming_ema_bps, incoming_bps); + self.outgoing_ema_bps = ema(self.outgoing_ema_bps, outgoing_bps); + + let tier1_now = self.throughput_ema_bps >= THROUGHPUT_UP_BPS; + if tier1_now { + self.tier1_hold_ticks = self.tier1_hold_ticks.saturating_add(1); + } else { + self.tier1_hold_ticks = 0; + } + + let ratio = if self.outgoing_ema_bps <= f64::EPSILON { + 0.0 + } else { + self.incoming_ema_bps / self.outgoing_ema_bps + }; + let tier2_now = ratio >= RATIO_CONFIRM_THRESHOLD; + if tier2_now { + self.tier2_hold_ticks = self.tier2_hold_ticks.saturating_add(1); + } else { + self.tier2_hold_ticks = 0; + } + + let partial_ratio = if sample.s2c_write_ops == 0 { + 0.0 + } else { + sample.s2c_partial_writes as f64 / sample.s2c_write_ops as f64 + }; + let hard_now = sample.s2c_consecutive_pending_writes >= HARD_PENDING_THRESHOLD + || partial_ratio >= HARD_PARTIAL_RATIO_THRESHOLD; + + if hard_now && self.hard_cooldown_ticks == 0 { + return self.promote(TierTransitionReason::HardPressure, HARD_COOLDOWN_TICKS); + } + + if self.tier1_hold_ticks >= TIER1_HOLD_TICKS && self.tier2_hold_ticks >= TIER2_HOLD_TICKS { + return self.promote(TierTransitionReason::SoftConfirmed, 0); + } + + let demote_candidate = self.throughput_ema_bps < THROUGHPUT_DOWN_BPS && !tier2_now && !hard_now; + if demote_candidate { + self.quiet_ticks = self.quiet_ticks.saturating_add(1); + if self.quiet_ticks >= QUIET_DEMOTE_TICKS { + self.quiet_ticks = 0; + return self.demote(TierTransitionReason::QuietDemotion); + } + } else { + self.quiet_ticks = 0; + } + + None + } + + fn promote( + &mut self, + reason: TierTransitionReason, + hard_cooldown_ticks: u32, + ) -> Option { + let from = self.tier; + let to = from.promote(); + if from == to { + return None; + } + self.tier = to; + self.max_tier_seen = max(self.max_tier_seen, to); + self.hard_cooldown_ticks = hard_cooldown_ticks; + self.tier1_hold_ticks = 0; + self.tier2_hold_ticks = 0; + self.quiet_ticks = 0; + Some(TierTransition { from, to, reason }) + } + + fn demote(&mut self, reason: TierTransitionReason) -> Option { + let from = self.tier; + let to = from.demote(); + if from == to { + return None; + } + self.tier = to; + self.tier1_hold_ticks = 0; + self.tier2_hold_ticks = 0; + Some(TierTransition { from, to, reason }) + } +} + +#[derive(Debug, Clone, Copy)] +struct UserAdaptiveProfile { + tier: AdaptiveTier, + seen_at: Instant, +} + +fn profiles() -> &'static DashMap { + static USER_PROFILES: OnceLock> = OnceLock::new(); + USER_PROFILES.get_or_init(DashMap::new) +} + +pub fn seed_tier_for_user(user: &str) -> AdaptiveTier { + let now = Instant::now(); + if let Some(entry) = profiles().get(user) { + let value = entry.value(); + if now.duration_since(value.seen_at) <= PROFILE_TTL { + return value.tier; + } + } + AdaptiveTier::Base +} + +pub fn record_user_tier(user: &str, tier: AdaptiveTier) { + let now = Instant::now(); + if let Some(mut entry) = profiles().get_mut(user) { + let existing = *entry; + let effective = if now.duration_since(existing.seen_at) > PROFILE_TTL { + tier + } else { + max(existing.tier, tier) + }; + *entry = UserAdaptiveProfile { + tier: effective, + seen_at: now, + }; + return; + } + profiles().insert( + user.to_string(), + UserAdaptiveProfile { tier, seen_at: now }, + ); +} + +pub fn direct_copy_buffers_for_tier( + tier: AdaptiveTier, + base_c2s: usize, + base_s2c: usize, +) -> (usize, usize) { + let (num, den) = tier.ratio(); + ( + scale(base_c2s, num, den, DIRECT_C2S_CAP_BYTES), + scale(base_s2c, num, den, DIRECT_S2C_CAP_BYTES), + ) +} + +pub fn me_flush_policy_for_tier( + tier: AdaptiveTier, + base_frames: usize, + base_bytes: usize, + base_delay: Duration, +) -> (usize, usize, Duration) { + let (num, den) = tier.ratio(); + let frames = scale(base_frames, num, den, ME_FRAMES_CAP).max(1); + let bytes = scale(base_bytes, num, den, ME_BYTES_CAP).max(4096); + let delay_us = base_delay.as_micros() as u64; + let adjusted_delay_us = match tier { + AdaptiveTier::Base => delay_us, + AdaptiveTier::Tier1 => (delay_us.saturating_mul(7)).saturating_div(10), + AdaptiveTier::Tier2 => delay_us.saturating_div(2), + AdaptiveTier::Tier3 => (delay_us.saturating_mul(3)).saturating_div(10), + } + .max(ME_DELAY_MIN_US) + .min(delay_us.max(ME_DELAY_MIN_US)); + (frames, bytes, Duration::from_micros(adjusted_delay_us)) +} + +fn ema(prev: f64, value: f64) -> f64 { + if prev <= f64::EPSILON { + value + } else { + (prev * (1.0 - EMA_ALPHA)) + (value * EMA_ALPHA) + } +} + +fn scale(base: usize, numerator: usize, denominator: usize, cap: usize) -> usize { + let scaled = base + .saturating_mul(numerator) + .saturating_div(denominator.max(1)); + scaled.min(cap).max(1) +} + +#[cfg(test)] +mod tests { + use super::*; + + fn sample( + c2s_bytes: u64, + s2c_requested_bytes: u64, + s2c_written_bytes: u64, + s2c_write_ops: u64, + s2c_partial_writes: u64, + s2c_consecutive_pending_writes: u32, + ) -> RelaySignalSample { + RelaySignalSample { + c2s_bytes, + s2c_requested_bytes, + s2c_written_bytes, + s2c_write_ops, + s2c_partial_writes, + s2c_consecutive_pending_writes, + } + } + + #[test] + fn test_soft_promotion_requires_tier1_and_tier2() { + let mut ctrl = SessionAdaptiveController::new(AdaptiveTier::Base); + let tick_secs = 0.25; + let mut promoted = None; + for _ in 0..8 { + promoted = ctrl.observe( + sample( + 300_000, // ~9.6 Mbps + 320_000, // incoming > outgoing to confirm tier2 + 250_000, + 10, + 0, + 0, + ), + tick_secs, + ); + } + + let transition = promoted.expect("expected soft promotion"); + assert_eq!(transition.from, AdaptiveTier::Base); + assert_eq!(transition.to, AdaptiveTier::Tier1); + assert_eq!(transition.reason, TierTransitionReason::SoftConfirmed); + } + + #[test] + fn test_hard_promotion_on_pending_pressure() { + let mut ctrl = SessionAdaptiveController::new(AdaptiveTier::Base); + let transition = ctrl + .observe( + sample(10_000, 20_000, 10_000, 4, 1, 3), + 0.25, + ) + .expect("expected hard promotion"); + assert_eq!(transition.reason, TierTransitionReason::HardPressure); + assert_eq!(transition.to, AdaptiveTier::Tier1); + } + + #[test] + fn test_quiet_demotion_is_slow_and_stepwise() { + let mut ctrl = SessionAdaptiveController::new(AdaptiveTier::Tier2); + let mut demotion = None; + for _ in 0..QUIET_DEMOTE_TICKS { + demotion = ctrl.observe(sample(1, 1, 1, 1, 0, 0), 0.25); + } + + let transition = demotion.expect("expected quiet demotion"); + assert_eq!(transition.from, AdaptiveTier::Tier2); + assert_eq!(transition.to, AdaptiveTier::Tier1); + assert_eq!(transition.reason, TierTransitionReason::QuietDemotion); + } +} diff --git a/src/proxy/client.rs b/src/proxy/client.rs index 8dad5da..25e6cf9 100644 --- a/src/proxy/client.rs +++ b/src/proxy/client.rs @@ -4,10 +4,7 @@ use std::future::Future; use std::net::{IpAddr, SocketAddr}; use std::pin::Pin; use std::sync::Arc; -use std::sync::OnceLock; -use std::sync::atomic::{AtomicBool, Ordering}; use std::time::Duration; -use ipnetwork::IpNetwork; use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite}; use tokio::net::TcpStream; use tokio::time::timeout; @@ -24,50 +21,9 @@ enum HandshakeOutcome { Handled, } -#[must_use = "UserConnectionReservation must be kept alive to retain user/IP reservation until release or drop"] -struct UserConnectionReservation { - stats: Arc, - ip_tracker: Arc, - user: String, - ip: IpAddr, - active: bool, -} - -impl UserConnectionReservation { - fn new(stats: Arc, ip_tracker: Arc, user: String, ip: IpAddr) -> Self { - Self { - stats, - ip_tracker, - user, - ip, - active: true, - } - } - - async fn release(mut self) { - if !self.active { - return; - } - self.ip_tracker.remove_ip(&self.user, self.ip).await; - self.active = false; - self.stats.decrement_user_curr_connects(&self.user); - } -} - -impl Drop for UserConnectionReservation { - fn drop(&mut self) { - if !self.active { - return; - } - self.active = false; - self.stats.decrement_user_curr_connects(&self.user); - self.ip_tracker.enqueue_cleanup(self.user.clone(), self.ip); - } -} - use crate::config::ProxyConfig; use crate::crypto::SecureRandom; -use crate::error::{HandshakeResult, ProxyError, Result, StreamError}; +use crate::error::{HandshakeResult, ProxyError, Result}; use crate::ip_tracker::UserIpTracker; use crate::protocol::constants::*; use crate::protocol::tls; @@ -84,21 +40,10 @@ use crate::proxy::handshake::{HandshakeSuccess, handle_mtproto_handshake, handle use crate::proxy::masking::handle_bad_client; use crate::proxy::middle_relay::handle_via_middle_proxy; use crate::proxy::route_mode::{RelayRouteMode, RouteRuntimeController}; +use crate::proxy::session_eviction::register_session; fn beobachten_ttl(config: &ProxyConfig) -> Duration { - let minutes = config.general.beobachten_minutes; - if minutes == 0 { - static BEOBACHTEN_ZERO_MINUTES_WARNED: OnceLock = OnceLock::new(); - let warned = BEOBACHTEN_ZERO_MINUTES_WARNED.get_or_init(|| AtomicBool::new(false)); - if !warned.swap(true, Ordering::Relaxed) { - warn!( - "general.beobachten_minutes=0 is insecure because entries expire immediately; forcing minimum TTL to 1 minute" - ); - } - return Duration::from_secs(60); - } - - Duration::from_secs(minutes.saturating_mul(60)) + Duration::from_secs(config.general.beobachten_minutes.saturating_mul(60)) } fn record_beobachten_class( @@ -119,34 +64,14 @@ fn record_handshake_failure_class( peer_ip: IpAddr, error: &ProxyError, ) { - let class = match error { - ProxyError::Io(err) if err.kind() == std::io::ErrorKind::UnexpectedEof => { - "expected_64_got_0" - } - ProxyError::Stream(StreamError::UnexpectedEof) => "expected_64_got_0", - _ => "other", + let class = if error.to_string().contains("expected 64 bytes, got 0") { + "expected_64_got_0" + } else { + "other" }; record_beobachten_class(beobachten, config, peer_ip, class); } -fn is_trusted_proxy_source(peer_ip: IpAddr, trusted: &[IpNetwork]) -> bool { - if trusted.is_empty() { - static EMPTY_PROXY_TRUST_WARNED: OnceLock = OnceLock::new(); - let warned = EMPTY_PROXY_TRUST_WARNED.get_or_init(|| AtomicBool::new(false)); - if !warned.swap(true, Ordering::Relaxed) { - warn!( - "PROXY protocol enabled but server.proxy_protocol_trusted_cidrs is empty; rejecting all PROXY headers by default" - ); - } - return false; - } - trusted.iter().any(|cidr| cidr.contains(peer_ip)) -} - -fn synthetic_local_addr(port: u16) -> SocketAddr { - SocketAddr::from(([0, 0, 0, 0], port)) -} - pub async fn handle_client_stream( mut stream: S, peer: SocketAddr, @@ -170,7 +95,9 @@ where let mut real_peer = normalize_ip(peer); // For non-TCP streams, use a synthetic local address; may be overridden by PROXY protocol dst - let mut local_addr = synthetic_local_addr(config.server.port); + let mut local_addr: SocketAddr = format!("0.0.0.0:{}", config.server.port) + .parse() + .unwrap_or_else(|_| "0.0.0.0:443".parse().unwrap()); if proxy_protocol_enabled { let proxy_header_timeout = Duration::from_millis( @@ -178,17 +105,6 @@ where ); match timeout(proxy_header_timeout, parse_proxy_protocol(&mut stream, peer)).await { Ok(Ok(info)) => { - if !is_trusted_proxy_source(peer.ip(), &config.server.proxy_protocol_trusted_cidrs) - { - stats.increment_connects_bad(); - warn!( - peer = %peer, - trusted = ?config.server.proxy_protocol_trusted_cidrs, - "Rejecting PROXY protocol header from untrusted source" - ); - record_beobachten_class(&beobachten, &config, peer.ip(), "other"); - return Err(ProxyError::InvalidProxyProtocol); - } debug!( peer = %peer, client = %info.src_addr, @@ -234,13 +150,8 @@ where if is_tls { let tls_len = u16::from_be_bytes([first_bytes[3], first_bytes[4]]) as usize; -// RFC 8446 §5.1 mandates that TLSPlaintext records must not exceed 2^14 - // bytes (16_384). A client claiming a larger record is non-compliant and - // may be an active probe attempting to force large allocations. - // - // Also enforce a minimum record size to avoid trivial/garbage probes. - if !(512..=MAX_TLS_RECORD_SIZE).contains(&tls_len) { - debug!(peer = %real_peer, tls_len = tls_len, max_tls_len = MAX_TLS_RECORD_SIZE, "TLS handshake length out of bounds"); + if tls_len < 512 { + debug!(peer = %real_peer, tls_len = tls_len, "TLS handshake too short"); stats.increment_connects_bad(); let (reader, writer) = tokio::io::split(stream); handle_bad_client( @@ -294,19 +205,9 @@ where &config, &replay_checker, true, Some(tls_user.as_str()), ).await { HandshakeResult::Success(result) => result, - HandshakeResult::BadClient { reader, writer } => { + HandshakeResult::BadClient { reader: _, writer: _ } => { stats.increment_connects_bad(); debug!(peer = %peer, "Valid TLS but invalid MTProto handshake"); - handle_bad_client( - reader, - writer, - &handshake, - real_peer, - local_addr, - &config, - &beobachten, - ) - .await; return Ok(HandshakeOutcome::Handled); } HandshakeResult::Error(e) => return Err(e), @@ -481,6 +382,7 @@ impl RunningClientHandler { pub async fn run(self) -> Result<()> { self.stats.increment_connects_all(); let peer = self.peer; + let _ip_tracker = self.ip_tracker.clone(); debug!(peer = %peer, "New connection"); if let Err(e) = configure_client_socket( @@ -544,24 +446,6 @@ impl RunningClientHandler { .await { Ok(Ok(info)) => { - if !is_trusted_proxy_source( - self.peer.ip(), - &self.config.server.proxy_protocol_trusted_cidrs, - ) { - self.stats.increment_connects_bad(); - warn!( - peer = %self.peer, - trusted = ?self.config.server.proxy_protocol_trusted_cidrs, - "Rejecting PROXY protocol header from untrusted source" - ); - record_beobachten_class( - &self.beobachten, - &self.config, - self.peer.ip(), - "other", - ); - return Err(ProxyError::InvalidProxyProtocol); - } debug!( peer = %self.peer, client = %info.src_addr, @@ -611,6 +495,7 @@ impl RunningClientHandler { let is_tls = tls::is_tls_handshake(&first_bytes[..3]); let peer = self.peer; + let _ip_tracker = self.ip_tracker.clone(); debug!(peer = %peer, is_tls = is_tls, "Handshake type detected"); @@ -623,15 +508,14 @@ impl RunningClientHandler { async fn handle_tls_client(mut self, first_bytes: [u8; 5], local_addr: SocketAddr) -> Result { let peer = self.peer; + let _ip_tracker = self.ip_tracker.clone(); let tls_len = u16::from_be_bytes([first_bytes[3], first_bytes[4]]) as usize; debug!(peer = %peer, tls_len = tls_len, "Reading TLS handshake"); - // See RFC 8446 §5.1: TLSPlaintext records must not exceed 16_384 bytes. - // Treat too-small or too-large lengths as active probes and mask them. - if !(512..=MAX_TLS_RECORD_SIZE).contains(&tls_len) { - debug!(peer = %peer, tls_len = tls_len, max_tls_len = MAX_TLS_RECORD_SIZE, "TLS handshake length out of bounds"); + if tls_len < 512 { + debug!(peer = %peer, tls_len = tls_len, "TLS handshake too short"); self.stats.increment_connects_bad(); let (reader, writer) = self.stream.into_split(); handle_bad_client( @@ -707,19 +591,12 @@ impl RunningClientHandler { .await { HandshakeResult::Success(result) => result, - HandshakeResult::BadClient { reader, writer } => { + HandshakeResult::BadClient { + reader: _, + writer: _, + } => { stats.increment_connects_bad(); debug!(peer = %peer, "Valid TLS but invalid MTProto handshake"); - handle_bad_client( - reader, - writer, - &handshake, - peer, - local_addr, - &config, - &self.beobachten, - ) - .await; return Ok(HandshakeOutcome::Handled); } HandshakeResult::Error(e) => return Err(e), @@ -746,6 +623,7 @@ impl RunningClientHandler { async fn handle_direct_client(mut self, first_bytes: [u8; 5], local_addr: SocketAddr) -> Result { let peer = self.peer; + let _ip_tracker = self.ip_tracker.clone(); if !self.config.general.modes.classic && !self.config.general.modes.secure { debug!(peer = %peer, "Non-TLS modes disabled"); @@ -849,22 +727,21 @@ impl RunningClientHandler { { let user = success.user.clone(); - let user_limit_reservation = - match Self::acquire_user_connection_reservation_static( - &user, - &config, - stats.clone(), - peer_addr, - ip_tracker, - ) - .await - { - Ok(reservation) => reservation, - Err(e) => { - warn!(user = %user, error = %e, "User admission check failed"); - return Err(e); - } - }; + if let Err(e) = Self::check_user_limits_static(&user, &config, &stats, peer_addr, &ip_tracker).await { + warn!(user = %user, error = %e, "User limit exceeded"); + return Err(e); + } + + let registration = register_session(&user, success.dc_idx); + if registration.replaced_existing { + stats.increment_reconnect_evict_total(); + warn!( + user = %user, + dc = success.dc_idx, + "Reconnect detected: replacing active session for user+dc" + ); + } + let session_lease = registration.lease; let route_snapshot = route_runtime.snapshot(); let session_id = rng.u64(); @@ -877,7 +754,7 @@ impl RunningClientHandler { client_writer, success, pool.clone(), - stats.clone(), + stats, config, buffer_pool, local_addr, @@ -885,6 +762,7 @@ impl RunningClientHandler { route_runtime.subscribe(), route_snapshot, session_id, + session_lease.clone(), ) .await } else { @@ -894,13 +772,14 @@ impl RunningClientHandler { client_writer, success, upstream_manager, - stats.clone(), + stats, config, buffer_pool, rng, route_runtime.subscribe(), route_snapshot, session_id, + session_lease.clone(), ) .await } @@ -911,78 +790,25 @@ impl RunningClientHandler { client_writer, success, upstream_manager, - stats.clone(), + stats, config, buffer_pool, rng, route_runtime.subscribe(), route_snapshot, session_id, + session_lease.clone(), ) .await }; - user_limit_reservation.release().await; + + ip_tracker.remove_ip(&user, peer_addr.ip()).await; relay_result } - async fn acquire_user_connection_reservation_static( - user: &str, - config: &ProxyConfig, - stats: Arc, - peer_addr: SocketAddr, - ip_tracker: Arc, - ) -> Result { - if let Some(expiration) = config.access.user_expirations.get(user) - && chrono::Utc::now() > *expiration - { - return Err(ProxyError::UserExpired { - user: user.to_string(), - }); - } - - if let Some(quota) = config.access.user_data_quota.get(user) - && stats.get_user_total_octets(user) >= *quota - { - return Err(ProxyError::DataQuotaExceeded { - user: user.to_string(), - }); - } - - let limit = config.access.user_max_tcp_conns.get(user).map(|v| *v as u64); - if !stats.try_acquire_user_curr_connects(user, limit) { - return Err(ProxyError::ConnectionLimitExceeded { - user: user.to_string(), - }); - } - - match ip_tracker.check_and_add(user, peer_addr.ip()).await { - Ok(()) => {} - Err(reason) => { - stats.decrement_user_curr_connects(user); - warn!( - user = %user, - ip = %peer_addr.ip(), - reason = %reason, - "IP limit exceeded" - ); - return Err(ProxyError::ConnectionLimitExceeded { - user: user.to_string(), - }); - } - } - - Ok(UserConnectionReservation::new( - stats, - ip_tracker, - user.to_string(), - peer_addr.ip(), - )) - } - - #[cfg(test)] async fn check_user_limits_static( - user: &str, - config: &ProxyConfig, + user: &str, + config: &ProxyConfig, stats: &Stats, peer_addr: SocketAddr, ip_tracker: &UserIpTracker, @@ -995,32 +821,9 @@ impl RunningClientHandler { }); } - if let Some(quota) = config.access.user_data_quota.get(user) - && stats.get_user_total_octets(user) >= *quota - { - return Err(ProxyError::DataQuotaExceeded { - user: user.to_string(), - }); - } - - let limit = config - .access - .user_max_tcp_conns - .get(user) - .map(|v| *v as u64); - if !stats.try_acquire_user_curr_connects(user, limit) { - return Err(ProxyError::ConnectionLimitExceeded { - user: user.to_string(), - }); - } - - match ip_tracker.check_and_add(user, peer_addr.ip()).await { - Ok(()) => { - ip_tracker.remove_ip(user, peer_addr.ip()).await; - stats.decrement_user_curr_connects(user); - } + let ip_reserved = match ip_tracker.check_and_add(user, peer_addr.ip()).await { + Ok(()) => true, Err(reason) => { - stats.decrement_user_curr_connects(user); warn!( user = %user, ip = %peer_addr.ip(), @@ -1031,14 +834,33 @@ impl RunningClientHandler { user: user.to_string(), }); } + }; + // IP limit check + + if let Some(limit) = config.access.user_max_tcp_conns.get(user) + && stats.get_user_curr_connects(user) >= *limit as u64 + { + if ip_reserved { + ip_tracker.remove_ip(user, peer_addr.ip()).await; + stats.increment_ip_reservation_rollback_tcp_limit_total(); + } + return Err(ProxyError::ConnectionLimitExceeded { + user: user.to_string(), + }); + } + + if let Some(quota) = config.access.user_data_quota.get(user) + && stats.get_user_total_octets(user) >= *quota + { + if ip_reserved { + ip_tracker.remove_ip(user, peer_addr.ip()).await; + stats.increment_ip_reservation_rollback_quota_limit_total(); + } + return Err(ProxyError::DataQuotaExceeded { + user: user.to_string(), + }); } Ok(()) } } - -#[cfg(test)] -#[path = "client_security_tests.rs"] -mod security_tests; -#[path = "client_adversarial_tests.rs"] -mod adversarial_tests; diff --git a/src/proxy/direct_relay.rs b/src/proxy/direct_relay.rs index ede908e..ac656d4 100644 --- a/src/proxy/direct_relay.rs +++ b/src/proxy/direct_relay.rs @@ -1,11 +1,7 @@ -use std::ffi::OsString; use std::fs::OpenOptions; use std::io::Write; use std::net::SocketAddr; -use std::path::{Component, Path, PathBuf}; use std::sync::Arc; -use std::collections::HashSet; -use std::sync::{Mutex, OnceLock}; use tokio::io::{AsyncRead, AsyncWrite, AsyncWriteExt}; use tokio::net::TcpStream; @@ -22,155 +18,12 @@ use crate::proxy::route_mode::{ RelayRouteMode, RouteCutoverState, ROUTE_SWITCH_ERROR_MSG, affected_cutover_state, cutover_stagger_delay, }; +use crate::proxy::adaptive_buffers; +use crate::proxy::session_eviction::SessionLease; use crate::stats::Stats; use crate::stream::{BufferPool, CryptoReader, CryptoWriter}; use crate::transport::UpstreamManager; -#[cfg(unix)] -use std::os::unix::fs::OpenOptionsExt; - -const UNKNOWN_DC_LOG_DISTINCT_LIMIT: usize = 1024; -static LOGGED_UNKNOWN_DCS: OnceLock>> = OnceLock::new(); -const MAX_SCOPE_HINT_LEN: usize = 64; - -fn validated_scope_hint(user: &str) -> Option<&str> { - let scope = user.strip_prefix("scope_")?; - if scope.is_empty() || scope.len() > MAX_SCOPE_HINT_LEN { - return None; - } - if scope - .bytes() - .all(|b| b.is_ascii_alphanumeric() || b == b'-') - { - Some(scope) - } else { - None - } -} - -#[derive(Clone)] -struct SanitizedUnknownDcLogPath { - resolved_path: PathBuf, - allowed_parent: PathBuf, - file_name: OsString, -} - -// In tests, this function shares global mutable state. Callers that also use -// cache-reset helpers must hold `unknown_dc_test_lock()` to keep assertions -// deterministic under parallel execution. -fn should_log_unknown_dc(dc_idx: i16) -> bool { - let set = LOGGED_UNKNOWN_DCS.get_or_init(|| Mutex::new(HashSet::new())); - should_log_unknown_dc_with_set(set, dc_idx) -} - -fn should_log_unknown_dc_with_set(set: &Mutex>, dc_idx: i16) -> bool { - match set.lock() { - Ok(mut guard) => { - if guard.contains(&dc_idx) { - return false; - } - if guard.len() >= UNKNOWN_DC_LOG_DISTINCT_LIMIT { - return false; - } - guard.insert(dc_idx) - } - // Fail closed on poisoned state to avoid unbounded blocking log writes. - Err(_) => false, - } -} - -fn sanitize_unknown_dc_log_path(path: &str) -> Option { - let candidate = Path::new(path); - if candidate.as_os_str().is_empty() { - return None; - } - if candidate - .components() - .any(|component| matches!(component, Component::ParentDir)) - { - return None; - } - - let cwd = std::env::current_dir().ok()?; - let file_name = candidate.file_name()?; - let parent = candidate.parent().unwrap_or_else(|| Path::new(".")); - let parent_path = if parent.is_absolute() { - parent.to_path_buf() - } else { - cwd.join(parent) - }; - let canonical_parent = parent_path.canonicalize().ok()?; - if !canonical_parent.is_dir() { - return None; - } - - Some(SanitizedUnknownDcLogPath { - resolved_path: canonical_parent.join(file_name), - allowed_parent: canonical_parent, - file_name: file_name.to_os_string(), - }) -} - -fn unknown_dc_log_path_is_still_safe(path: &SanitizedUnknownDcLogPath) -> bool { - let Some(parent) = path.resolved_path.parent() else { - return false; - }; - let Ok(current_parent) = parent.canonicalize() else { - return false; - }; - if current_parent != path.allowed_parent { - return false; - } - - if let Ok(canonical_target) = path.resolved_path.canonicalize() { - let Some(target_parent) = canonical_target.parent() else { - return false; - }; - let Some(target_name) = canonical_target.file_name() else { - return false; - }; - if target_parent != path.allowed_parent || target_name != path.file_name { - return false; - } - } - - true -} - -fn open_unknown_dc_log_append(path: &Path) -> std::io::Result { - #[cfg(unix)] - { - OpenOptions::new() - .create(true) - .append(true) - .custom_flags(libc::O_NOFOLLOW) - .open(path) - } - #[cfg(not(unix))] - { - let _ = path; - Err(std::io::Error::new( - std::io::ErrorKind::PermissionDenied, - "unknown_dc_file_log_enabled requires unix O_NOFOLLOW support", - )) - } -} - -#[cfg(test)] -fn clear_unknown_dc_log_cache_for_testing() { - if let Some(set) = LOGGED_UNKNOWN_DCS.get() - && let Ok(mut guard) = set.lock() - { - guard.clear(); - } -} - -#[cfg(test)] -fn unknown_dc_test_lock() -> &'static Mutex<()> { - static TEST_LOCK: OnceLock> = OnceLock::new(); - TEST_LOCK.get_or_init(|| Mutex::new(())) -} - pub(crate) async fn handle_via_direct( client_reader: CryptoReader, client_writer: CryptoWriter, @@ -183,6 +36,7 @@ pub(crate) async fn handle_via_direct( mut route_rx: watch::Receiver, route_snapshot: RouteCutoverState, session_id: u64, + session_lease: SessionLease, ) -> Result<()> where R: AsyncRead + Unpin + Send + 'static, @@ -201,15 +55,8 @@ where "Connecting to Telegram DC" ); - let scope_hint = validated_scope_hint(user); - if user.starts_with("scope_") && scope_hint.is_none() { - warn!( - user = %user, - "Ignoring invalid scope hint and falling back to default upstream selection" - ); - } let tg_stream = upstream_manager - .connect(dc_addr, Some(success.dc_idx), scope_hint) + .connect(dc_addr, Some(success.dc_idx), user.strip_prefix("scope_").filter(|s| !s.is_empty())) .await?; debug!(peer = %success.peer, dc_addr = %dc_addr, "Connected, performing TG handshake"); @@ -220,19 +67,29 @@ where debug!(peer = %success.peer, "TG handshake complete, starting relay"); stats.increment_user_connects(user); - let _direct_connection_lease = stats.acquire_direct_connection_lease(); + stats.increment_user_curr_connects(user); + stats.increment_current_connections_direct(); + + let seed_tier = adaptive_buffers::seed_tier_for_user(user); + let (c2s_copy_buf, s2c_copy_buf) = adaptive_buffers::direct_copy_buffers_for_tier( + seed_tier, + config.general.direct_relay_copy_buf_c2s_bytes, + config.general.direct_relay_copy_buf_s2c_bytes, + ); let relay_result = relay_bidirectional( client_reader, client_writer, tg_reader, tg_writer, - config.general.direct_relay_copy_buf_c2s_bytes, - config.general.direct_relay_copy_buf_s2c_bytes, + c2s_copy_buf, + s2c_copy_buf, user, + success.dc_idx, Arc::clone(&stats), - config.access.user_data_quota.get(user).copied(), buffer_pool, + session_lease, + seed_tier, ); tokio::pin!(relay_result); let relay_result = loop { @@ -264,6 +121,9 @@ where } }; + stats.decrement_current_connections_direct(); + stats.decrement_user_curr_connects(user); + match &relay_result { Ok(()) => debug!(user = %user, "Direct relay completed"), Err(e) => debug!(user = %user, error = %e, "Direct relay ended with error"), @@ -315,19 +175,12 @@ fn get_dc_addr_static(dc_idx: i16, config: &ProxyConfig) -> Result { && let Some(path) = &config.general.unknown_dc_log_path && let Ok(handle) = tokio::runtime::Handle::try_current() { - if let Some(path) = sanitize_unknown_dc_log_path(path) { - if should_log_unknown_dc(dc_idx) { - handle.spawn_blocking(move || { - if unknown_dc_log_path_is_still_safe(&path) - && let Ok(mut file) = open_unknown_dc_log_append(&path.resolved_path) - { - let _ = writeln!(file, "dc_idx={dc_idx}"); - } - }); + let path = path.clone(); + handle.spawn_blocking(move || { + if let Ok(mut file) = OpenOptions::new().create(true).append(true).open(path) { + let _ = writeln!(file, "dc_idx={dc_idx}"); } - } else { - warn!(dc_idx = dc_idx, raw_path = %path, "Rejected unsafe unknown DC log path"); - } + }); } } @@ -335,7 +188,7 @@ fn get_dc_addr_static(dc_idx: i16, config: &ProxyConfig) -> Result { let fallback_idx = if default_dc >= 1 && default_dc <= num_dcs { default_dc - 1 } else { - 0 + 1 }; info!( @@ -388,7 +241,3 @@ async fn do_tg_handshake_static( CryptoWriter::new(write_half, tg_encryptor, max_pending), )) } - -#[cfg(test)] -#[path = "direct_relay_security_tests.rs"] -mod security_tests; diff --git a/src/proxy/middle_relay.rs b/src/proxy/middle_relay.rs index 7298cb4..102b06c 100644 --- a/src/proxy/middle_relay.rs +++ b/src/proxy/middle_relay.rs @@ -1,15 +1,14 @@ -use std::collections::hash_map::RandomState; -use std::hash::BuildHasher; +use std::collections::HashMap; +use std::collections::hash_map::DefaultHasher; use std::hash::{Hash, Hasher}; use std::net::{IpAddr, SocketAddr}; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::atomic::{AtomicU64, Ordering}; use std::sync::{Arc, Mutex, OnceLock}; use std::time::{Duration, Instant}; -use dashmap::DashMap; +use bytes::Bytes; use tokio::io::{AsyncRead, AsyncReadExt, AsyncWrite, AsyncWriteExt}; -use tokio::sync::{mpsc, oneshot, watch, Mutex as AsyncMutex}; -use tokio::time::timeout; +use tokio::sync::{mpsc, oneshot, watch}; use tracing::{debug, trace, warn}; use crate::config::ProxyConfig; @@ -21,38 +20,25 @@ use crate::proxy::route_mode::{ RelayRouteMode, RouteCutoverState, ROUTE_SWITCH_ERROR_MSG, affected_cutover_state, cutover_stagger_delay, }; +use crate::proxy::adaptive_buffers::{self, AdaptiveTier}; +use crate::proxy::session_eviction::SessionLease; use crate::stats::Stats; -use crate::stream::{BufferPool, CryptoReader, CryptoWriter, PooledBuffer}; +use crate::stream::{BufferPool, CryptoReader, CryptoWriter}; use crate::transport::middle_proxy::{MePool, MeResponse, proto_flags_for_tag}; enum C2MeCommand { - Data { payload: PooledBuffer, flags: u32 }, + Data { payload: Bytes, flags: u32 }, Close, } const DESYNC_DEDUP_WINDOW: Duration = Duration::from_secs(60); -const DESYNC_DEDUP_MAX_ENTRIES: usize = 65_536; -const DESYNC_DEDUP_PRUNE_SCAN_LIMIT: usize = 1024; -const DESYNC_FULL_CACHE_EMIT_MIN_INTERVAL: Duration = Duration::from_millis(1000); const DESYNC_ERROR_CLASS: &str = "frame_too_large_crypto_desync"; const C2ME_CHANNEL_CAPACITY_FALLBACK: usize = 128; const C2ME_SOFT_PRESSURE_MIN_FREE_SLOTS: usize = 64; const C2ME_SENDER_FAIRNESS_BUDGET: usize = 32; -#[cfg(test)] -const C2ME_SEND_TIMEOUT: Duration = Duration::from_millis(50); -#[cfg(not(test))] -const C2ME_SEND_TIMEOUT: Duration = Duration::from_secs(5); const ME_D2C_FLUSH_BATCH_MAX_FRAMES_MIN: usize = 1; const ME_D2C_FLUSH_BATCH_MAX_BYTES_MIN: usize = 4096; -#[cfg(test)] -const QUOTA_USER_LOCKS_MAX: usize = 64; -#[cfg(not(test))] -const QUOTA_USER_LOCKS_MAX: usize = 4_096; -static DESYNC_DEDUP: OnceLock> = OnceLock::new(); -static DESYNC_HASHER: OnceLock = OnceLock::new(); -static DESYNC_FULL_CACHE_LAST_EMIT_AT: OnceLock>> = OnceLock::new(); -static DESYNC_DEDUP_EVER_SATURATED: OnceLock = OnceLock::new(); -static QUOTA_USER_LOCKS: OnceLock>>> = OnceLock::new(); +static DESYNC_DEDUP: OnceLock>> = OnceLock::new(); struct RelayForensicsState { trace_id: u64, @@ -75,8 +61,8 @@ struct MeD2cFlushPolicy { } impl MeD2cFlushPolicy { - fn from_config(config: &ProxyConfig) -> Self { - Self { + fn from_config(config: &ProxyConfig, tier: AdaptiveTier) -> Self { + let base = Self { max_frames: config .general .me_d2c_flush_batch_max_frames @@ -87,13 +73,24 @@ impl MeD2cFlushPolicy { .max(ME_D2C_FLUSH_BATCH_MAX_BYTES_MIN), max_delay: Duration::from_micros(config.general.me_d2c_flush_batch_max_delay_us), ack_flush_immediate: config.general.me_d2c_ack_flush_immediate, + }; + let (max_frames, max_bytes, max_delay) = adaptive_buffers::me_flush_policy_for_tier( + tier, + base.max_frames, + base.max_bytes, + base.max_delay, + ); + Self { + max_frames, + max_bytes, + max_delay, + ack_flush_immediate: base.ack_flush_immediate, } } } fn hash_value(value: &T) -> u64 { - let state = DESYNC_HASHER.get_or_init(RandomState::new); - let mut hasher = state.build_hasher(); + let mut hasher = DefaultHasher::new(); value.hash(&mut hasher); hasher.finish() } @@ -107,122 +104,26 @@ fn should_emit_full_desync(key: u64, all_full: bool, now: Instant) -> bool { return true; } - let dedup = DESYNC_DEDUP.get_or_init(DashMap::new); - let saturated_before = dedup.len() >= DESYNC_DEDUP_MAX_ENTRIES; - let ever_saturated = DESYNC_DEDUP_EVER_SATURATED.get_or_init(|| AtomicBool::new(false)); - if saturated_before { - ever_saturated.store(true, Ordering::Relaxed); - } + let dedup = DESYNC_DEDUP.get_or_init(|| Mutex::new(HashMap::new())); + let mut guard = dedup.lock().expect("desync dedup mutex poisoned"); + guard.retain(|_, seen_at| now.duration_since(*seen_at) < DESYNC_DEDUP_WINDOW); - if let Some(mut seen_at) = dedup.get_mut(&key) { - if now.duration_since(*seen_at) >= DESYNC_DEDUP_WINDOW { - *seen_at = now; - return true; - } - return false; - } - - if dedup.len() >= DESYNC_DEDUP_MAX_ENTRIES { - let mut stale_keys = Vec::new(); - let mut oldest_candidate: Option<(u64, Instant)> = None; - for entry in dedup.iter().take(DESYNC_DEDUP_PRUNE_SCAN_LIMIT) { - let key = *entry.key(); - let seen_at = *entry.value(); - - match oldest_candidate { - Some((_, oldest_seen)) if seen_at >= oldest_seen => {} - _ => oldest_candidate = Some((key, seen_at)), - } - - if now.duration_since(seen_at) >= DESYNC_DEDUP_WINDOW { - stale_keys.push(*entry.key()); - } - } - for stale_key in stale_keys { - dedup.remove(&stale_key); - } - if dedup.len() >= DESYNC_DEDUP_MAX_ENTRIES { - let Some((evict_key, _)) = oldest_candidate else { - return false; - }; - dedup.remove(&evict_key); - dedup.insert(key, now); - return should_emit_full_desync_full_cache(now); - } - } - - dedup.insert(key, now); - let saturated_after = dedup.len() >= DESYNC_DEDUP_MAX_ENTRIES; - // Preserve the first sequential insert that reaches capacity as a normal - // emit, while still gating concurrent newcomer churn after the cache has - // ever been observed at saturation. - let was_ever_saturated = if saturated_after { - ever_saturated.swap(true, Ordering::Relaxed) - } else { - ever_saturated.load(Ordering::Relaxed) - }; - - if saturated_before || (saturated_after && was_ever_saturated) { - should_emit_full_desync_full_cache(now) - } else { - true - } -} - -fn should_emit_full_desync_full_cache(now: Instant) -> bool { - let gate = DESYNC_FULL_CACHE_LAST_EMIT_AT.get_or_init(|| Mutex::new(None)); - let Ok(mut last_emit_at) = gate.lock() else { - return false; - }; - - match *last_emit_at { - None => { - *last_emit_at = Some(now); - true - } - Some(last) => { - let Some(elapsed) = now.checked_duration_since(last) else { - *last_emit_at = Some(now); - return true; - }; - if elapsed >= DESYNC_FULL_CACHE_EMIT_MIN_INTERVAL { - *last_emit_at = Some(now); + match guard.get_mut(&key) { + Some(seen_at) => { + if now.duration_since(*seen_at) >= DESYNC_DEDUP_WINDOW { + *seen_at = now; true } else { false } } - } -} - -#[cfg(test)] -fn clear_desync_dedup_for_testing() { - if let Some(dedup) = DESYNC_DEDUP.get() { - dedup.clear(); - } - if let Some(ever_saturated) = DESYNC_DEDUP_EVER_SATURATED.get() { - ever_saturated.store(false, Ordering::Relaxed); - } - if let Some(last_emit_at) = DESYNC_FULL_CACHE_LAST_EMIT_AT.get() { - match last_emit_at.lock() { - Ok(mut guard) => { - *guard = None; - } - Err(poisoned) => { - let mut guard = poisoned.into_inner(); - *guard = None; - last_emit_at.clear_poison(); - } + None => { + guard.insert(key, now); + true } } } -#[cfg(test)] -fn desync_dedup_test_lock() -> &'static Mutex<()> { - static TEST_LOCK: OnceLock> = OnceLock::new(); - TEST_LOCK.get_or_init(|| Mutex::new(())) -} - fn report_desync_frame_too_large( state: &RelayForensicsState, proto_tag: ProtoTag, @@ -318,49 +219,10 @@ fn should_yield_c2me_sender(sent_since_yield: usize, has_backlog: bool) -> bool has_backlog && sent_since_yield >= C2ME_SENDER_FAIRNESS_BUDGET } -fn quota_exceeded_for_user(stats: &Stats, user: &str, quota_limit: Option) -> bool { - quota_limit.is_some_and(|quota| stats.get_user_total_octets(user) >= quota) -} - -fn quota_would_be_exceeded_for_user( - stats: &Stats, - user: &str, - quota_limit: Option, - bytes: u64, -) -> bool { - quota_limit.is_some_and(|quota| { - let used = stats.get_user_total_octets(user); - used >= quota || bytes > quota.saturating_sub(used) - }) -} - -fn quota_user_lock(user: &str) -> Arc> { - let locks = QUOTA_USER_LOCKS.get_or_init(DashMap::new); - if let Some(existing) = locks.get(user) { - return Arc::clone(existing.value()); - } - - if locks.len() >= QUOTA_USER_LOCKS_MAX { - locks.retain(|_, value| Arc::strong_count(value) > 1); - } - - if locks.len() >= QUOTA_USER_LOCKS_MAX { - return Arc::new(AsyncMutex::new(())); - } - - let created = Arc::new(AsyncMutex::new(())); - match locks.entry(user.to_string()) { - dashmap::mapref::entry::Entry::Occupied(entry) => Arc::clone(entry.get()), - dashmap::mapref::entry::Entry::Vacant(entry) => { - entry.insert(Arc::clone(&created)); - created - } - } -} - async fn enqueue_c2me_command( tx: &mpsc::Sender, cmd: C2MeCommand, + send_timeout: Duration, ) -> std::result::Result<(), mpsc::error::SendError> { match tx.try_send(cmd) { Ok(()) => Ok(()), @@ -370,7 +232,10 @@ async fn enqueue_c2me_command( if tx.capacity() <= C2ME_SOFT_PRESSURE_MIN_FREE_SLOTS { tokio::task::yield_now().await; } - match timeout(C2ME_SEND_TIMEOUT, tx.reserve()).await { + if send_timeout.is_zero() { + return tx.send(cmd).await; + } + match tokio::time::timeout(send_timeout, tx.reserve()).await { Ok(Ok(permit)) => { permit.send(cmd); Ok(()) @@ -389,22 +254,23 @@ pub(crate) async fn handle_via_middle_proxy( me_pool: Arc, stats: Arc, config: Arc, - buffer_pool: Arc, + _buffer_pool: Arc, local_addr: SocketAddr, rng: Arc, mut route_rx: watch::Receiver, route_snapshot: RouteCutoverState, session_id: u64, + session_lease: SessionLease, ) -> Result<()> where R: AsyncRead + Unpin + Send + 'static, W: AsyncWrite + Unpin + Send + 'static, { let user = success.user.clone(); - let quota_limit = config.access.user_data_quota.get(&user).copied(); let peer = success.peer; let proto_tag = success.proto_tag; let pool_generation = me_pool.current_generation(); + let seed_tier = adaptive_buffers::seed_tier_for_user(&user); debug!( user = %user, @@ -417,7 +283,7 @@ where ); let (conn_id, me_rx) = me_pool.registry().register().await; - let trace_id = session_id; + let trace_id = conn_id; let bytes_me2c = Arc::new(AtomicU64::new(0)); let mut forensics = RelayForensicsState { trace_id, @@ -432,7 +298,8 @@ where }; stats.increment_user_connects(&user); - let _me_connection_lease = stats.acquire_me_connection_lease(); + stats.increment_user_curr_connects(&user); + stats.increment_current_connections_me(); if let Some(cutover) = affected_cutover_state( &route_rx, @@ -450,9 +317,20 @@ where tokio::time::sleep(delay).await; let _ = me_pool.send_close(conn_id).await; me_pool.registry().unregister(conn_id).await; + stats.decrement_current_connections_me(); + stats.decrement_user_curr_connects(&user); return Err(ProxyError::Proxy(ROUTE_SWITCH_ERROR_MSG.to_string())); } + if session_lease.is_stale() { + stats.increment_reconnect_stale_close_total(); + let _ = me_pool.send_close(conn_id).await; + me_pool.registry().unregister(conn_id).await; + stats.decrement_current_connections_me(); + stats.decrement_user_curr_connects(&user); + return Err(ProxyError::Proxy("Session evicted by reconnect".to_string())); + } + // Per-user ad_tag from access.user_ad_tags; fallback to general.ad_tag (hot-reloadable) let user_tag: Option> = config .access @@ -488,6 +366,7 @@ where .general .me_c2me_channel_capacity .max(C2ME_CHANNEL_CAPACITY_FALLBACK); + let c2me_send_timeout = Duration::from_millis(config.general.me_c2me_send_timeout_ms); let (c2me_tx, mut c2me_rx) = mpsc::channel::(c2me_channel_capacity); let me_pool_c2me = me_pool.clone(); let effective_tag = effective_tag; @@ -496,15 +375,42 @@ where while let Some(cmd) = c2me_rx.recv().await { match cmd { C2MeCommand::Data { payload, flags } => { - me_pool_c2me.send_proxy_req( - conn_id, - success.dc_idx, - peer, - translated_local_addr, - payload.as_ref(), - flags, - effective_tag.as_deref(), - ).await?; + if c2me_send_timeout.is_zero() { + me_pool_c2me + .send_proxy_req( + conn_id, + success.dc_idx, + peer, + translated_local_addr, + payload.as_ref(), + flags, + effective_tag.as_deref(), + ) + .await?; + } else { + match tokio::time::timeout( + c2me_send_timeout, + me_pool_c2me.send_proxy_req( + conn_id, + success.dc_idx, + peer, + translated_local_addr, + payload.as_ref(), + flags, + effective_tag.as_deref(), + ), + ) + .await + { + Ok(send_result) => send_result?, + Err(_) => { + return Err(ProxyError::Proxy(format!( + "ME send timeout after {}ms", + c2me_send_timeout.as_millis() + ))); + } + } + } sent_since_yield = sent_since_yield.saturating_add(1); if should_yield_c2me_sender(sent_since_yield, !c2me_rx.is_empty()) { sent_since_yield = 0; @@ -526,7 +432,7 @@ where let rng_clone = rng.clone(); let user_clone = user.clone(); let bytes_me2c_clone = bytes_me2c.clone(); - let d2c_flush_policy = MeD2cFlushPolicy::from_config(&config); + let d2c_flush_policy = MeD2cFlushPolicy::from_config(&config, seed_tier); let me_writer = tokio::spawn(async move { let mut writer = crypto_writer; let mut frame_buf = Vec::with_capacity(16 * 1024); @@ -550,7 +456,6 @@ where &mut frame_buf, stats_clone.as_ref(), &user_clone, - quota_limit, bytes_me2c_clone.as_ref(), conn_id, d2c_flush_policy.ack_flush_immediate, @@ -583,7 +488,6 @@ where &mut frame_buf, stats_clone.as_ref(), &user_clone, - quota_limit, bytes_me2c_clone.as_ref(), conn_id, d2c_flush_policy.ack_flush_immediate, @@ -616,7 +520,6 @@ where &mut frame_buf, stats_clone.as_ref(), &user_clone, - quota_limit, bytes_me2c_clone.as_ref(), conn_id, d2c_flush_policy.ack_flush_immediate, @@ -649,7 +552,6 @@ where &mut frame_buf, stats_clone.as_ref(), &user_clone, - quota_limit, bytes_me2c_clone.as_ref(), conn_id, d2c_flush_policy.ack_flush_immediate, @@ -690,6 +592,12 @@ where let mut frame_counter: u64 = 0; let mut route_watch_open = true; loop { + if session_lease.is_stale() { + stats.increment_reconnect_stale_close_total(); + let _ = enqueue_c2me_command(&c2me_tx, C2MeCommand::Close, c2me_send_timeout).await; + main_result = Err(ProxyError::Proxy("Session evicted by reconnect".to_string())); + break; + } if let Some(cutover) = affected_cutover_state( &route_rx, RelayRouteMode::Middle, @@ -704,7 +612,7 @@ where "Cutover affected middle session, closing client connection" ); tokio::time::sleep(delay).await; - let _ = enqueue_c2me_command(&c2me_tx, C2MeCommand::Close).await; + let _ = enqueue_c2me_command(&c2me_tx, C2MeCommand::Close, c2me_send_timeout).await; main_result = Err(ProxyError::Proxy(ROUTE_SWITCH_ERROR_MSG.to_string())); break; } @@ -719,8 +627,6 @@ where &mut crypto_reader, proto_tag, frame_limit, - Duration::from_secs(config.timeouts.client_handshake.max(1)), - &buffer_pool, &forensics, &mut frame_counter, &stats, @@ -731,19 +637,7 @@ where forensics.bytes_c2me = forensics .bytes_c2me .saturating_add(payload.len() as u64); - if let Some(limit) = quota_limit { - let quota_lock = quota_user_lock(&user); - let _quota_guard = quota_lock.lock().await; - stats.add_user_octets_from(&user, payload.len() as u64); - if quota_exceeded_for_user(stats.as_ref(), &user, Some(limit)) { - main_result = Err(ProxyError::DataQuotaExceeded { - user: user.clone(), - }); - break; - } - } else { - stats.add_user_octets_from(&user, payload.len() as u64); - } + stats.add_user_octets_from(&user, payload.len() as u64); let mut flags = proto_flags; if quickack { flags |= RPC_FLAG_QUICKACK; @@ -752,9 +646,13 @@ where flags |= RPC_FLAG_NOT_ENCRYPTED; } // Keep client read loop lightweight: route heavy ME send path via a dedicated task. - if enqueue_c2me_command(&c2me_tx, C2MeCommand::Data { payload, flags }) - .await - .is_err() + if enqueue_c2me_command( + &c2me_tx, + C2MeCommand::Data { payload, flags }, + c2me_send_timeout, + ) + .await + .is_err() { main_result = Err(ProxyError::Proxy("ME sender channel closed".into())); break; @@ -763,7 +661,12 @@ where Ok(None) => { debug!(conn_id, "Client EOF"); client_closed = true; - let _ = enqueue_c2me_command(&c2me_tx, C2MeCommand::Close).await; + let _ = enqueue_c2me_command( + &c2me_tx, + C2MeCommand::Close, + c2me_send_timeout, + ) + .await; break; } Err(e) => { @@ -812,7 +715,10 @@ where frames_ok = frame_counter, "ME relay cleanup" ); + adaptive_buffers::record_user_tier(&user, seed_tier); me_pool.registry().unregister(conn_id).await; + stats.decrement_current_connections_me(); + stats.decrement_user_curr_connects(&user); result } @@ -820,49 +726,30 @@ async fn read_client_payload( client_reader: &mut CryptoReader, proto_tag: ProtoTag, max_frame: usize, - frame_read_timeout: Duration, - buffer_pool: &Arc, forensics: &RelayForensicsState, frame_counter: &mut u64, stats: &Stats, -) -> Result> +) -> Result> where R: AsyncRead + Unpin + Send + 'static, { - async fn read_exact_with_timeout( - client_reader: &mut CryptoReader, - buf: &mut [u8], - frame_read_timeout: Duration, - ) -> Result<()> - where - R: AsyncRead + Unpin + Send + 'static, - { - match timeout(frame_read_timeout, client_reader.read_exact(buf)).await { - Ok(Ok(_)) => Ok(()), - Ok(Err(e)) => Err(ProxyError::Io(e)), - Err(_) => Err(ProxyError::Io(std::io::Error::new( - std::io::ErrorKind::TimedOut, - "middle-relay client frame read timeout", - ))), - } - } - loop { let (len, quickack, raw_len_bytes) = match proto_tag { ProtoTag::Abridged => { let mut first = [0u8; 1]; - match read_exact_with_timeout(client_reader, &mut first, frame_read_timeout).await { - Ok(()) => {} - Err(ProxyError::Io(e)) if e.kind() == std::io::ErrorKind::UnexpectedEof => { - return Ok(None); - } - Err(e) => return Err(e), + match client_reader.read_exact(&mut first).await { + Ok(_) => {} + Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), + Err(e) => return Err(ProxyError::Io(e)), } let quickack = (first[0] & 0x80) != 0; let len_words = if (first[0] & 0x7f) == 0x7f { let mut ext = [0u8; 3]; - read_exact_with_timeout(client_reader, &mut ext, frame_read_timeout).await?; + client_reader + .read_exact(&mut ext) + .await + .map_err(ProxyError::Io)?; u32::from_le_bytes([ext[0], ext[1], ext[2], 0]) as usize } else { (first[0] & 0x7f) as usize @@ -875,12 +762,10 @@ where } ProtoTag::Intermediate | ProtoTag::Secure => { let mut len_buf = [0u8; 4]; - match read_exact_with_timeout(client_reader, &mut len_buf, frame_read_timeout).await { - Ok(()) => {} - Err(ProxyError::Io(e)) if e.kind() == std::io::ErrorKind::UnexpectedEof => { - return Ok(None); - } - Err(e) => return Err(e), + match client_reader.read_exact(&mut len_buf).await { + Ok(_) => {} + Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => return Ok(None), + Err(e) => return Err(ProxyError::Io(e)), } let quickack = (len_buf[3] & 0x80) != 0; ( @@ -932,21 +817,18 @@ where len }; - let mut payload = buffer_pool.get(); - payload.clear(); - let current_cap = payload.capacity(); - if current_cap < len { - payload.reserve(len - current_cap); - } - payload.resize(len, 0); - read_exact_with_timeout(client_reader, &mut payload[..len], frame_read_timeout).await?; + let mut payload = vec![0u8; len]; + client_reader + .read_exact(&mut payload) + .await + .map_err(ProxyError::Io)?; // Secure Intermediate: strip validated trailing padding bytes. if proto_tag == ProtoTag::Secure { payload.truncate(secure_payload_len); } *frame_counter += 1; - return Ok(Some((payload, quickack))); + return Ok(Some((Bytes::from(payload), quickack))); } } @@ -967,7 +849,6 @@ async fn process_me_writer_response( frame_buf: &mut Vec, stats: &Stats, user: &str, - quota_limit: Option, bytes_me2c: &AtomicU64, conn_id: u64, ack_flush_immediate: bool, @@ -983,47 +864,17 @@ where } else { trace!(conn_id, bytes = data.len(), flags, "ME->C data"); } - let data_len = data.len() as u64; - if let Some(limit) = quota_limit { - let quota_lock = quota_user_lock(user); - let _quota_guard = quota_lock.lock().await; - if quota_would_be_exceeded_for_user(stats, user, Some(limit), data_len) { - return Err(ProxyError::DataQuotaExceeded { - user: user.to_string(), - }); - } - write_client_payload( - client_writer, - proto_tag, - flags, - &data, - rng, - frame_buf, - ) - .await?; - - bytes_me2c.fetch_add(data.len() as u64, Ordering::Relaxed); - stats.add_user_octets_to(user, data.len() as u64); - - if quota_exceeded_for_user(stats, user, Some(limit)) { - return Err(ProxyError::DataQuotaExceeded { - user: user.to_string(), - }); - } - } else { - write_client_payload( - client_writer, - proto_tag, - flags, - &data, - rng, - frame_buf, - ) - .await?; - - bytes_me2c.fetch_add(data.len() as u64, Ordering::Relaxed); - stats.add_user_octets_to(user, data.len() as u64); - } + bytes_me2c.fetch_add(data.len() as u64, Ordering::Relaxed); + stats.add_user_octets_to(user, data.len() as u64); + write_client_payload( + client_writer, + proto_tag, + flags, + &data, + rng, + frame_buf, + ) + .await?; Ok(MeWriterResponseOutcome::Continue { frames: 1, @@ -1169,5 +1020,84 @@ where } #[cfg(test)] -#[path = "middle_relay_security_tests.rs"] -mod security_tests; +mod tests { + use super::*; + use tokio::time::{Duration as TokioDuration, timeout}; + + #[test] + fn should_yield_sender_only_on_budget_with_backlog() { + assert!(!should_yield_c2me_sender(0, true)); + assert!(!should_yield_c2me_sender(C2ME_SENDER_FAIRNESS_BUDGET - 1, true)); + assert!(!should_yield_c2me_sender(C2ME_SENDER_FAIRNESS_BUDGET, false)); + assert!(should_yield_c2me_sender(C2ME_SENDER_FAIRNESS_BUDGET, true)); + } + + #[tokio::test] + async fn enqueue_c2me_command_uses_try_send_fast_path() { + let (tx, mut rx) = mpsc::channel::(2); + enqueue_c2me_command( + &tx, + C2MeCommand::Data { + payload: Bytes::from_static(&[1, 2, 3]), + flags: 0, + }, + TokioDuration::from_millis(50), + ) + .await + .unwrap(); + + let recv = timeout(TokioDuration::from_millis(50), rx.recv()) + .await + .unwrap() + .unwrap(); + match recv { + C2MeCommand::Data { payload, flags } => { + assert_eq!(payload.as_ref(), &[1, 2, 3]); + assert_eq!(flags, 0); + } + C2MeCommand::Close => panic!("unexpected close command"), + } + } + + #[tokio::test] + async fn enqueue_c2me_command_falls_back_to_send_when_queue_is_full() { + let (tx, mut rx) = mpsc::channel::(1); + tx.send(C2MeCommand::Data { + payload: Bytes::from_static(&[9]), + flags: 9, + }) + .await + .unwrap(); + + let tx2 = tx.clone(); + let producer = tokio::spawn(async move { + enqueue_c2me_command( + &tx2, + C2MeCommand::Data { + payload: Bytes::from_static(&[7, 7]), + flags: 7, + }, + TokioDuration::from_millis(100), + ) + .await + .unwrap(); + }); + + let _ = timeout(TokioDuration::from_millis(100), rx.recv()) + .await + .unwrap(); + producer.await.unwrap(); + + let recv = timeout(TokioDuration::from_millis(100), rx.recv()) + .await + .unwrap() + .unwrap(); + match recv { + C2MeCommand::Data { payload, flags } => { + assert_eq!(payload.as_ref(), &[7, 7]); + assert_eq!(flags, 7); + } + C2MeCommand::Close => panic!("unexpected close command"), + } + } +} diff --git a/src/proxy/mod.rs b/src/proxy/mod.rs index 1eed469..ab840f6 100644 --- a/src/proxy/mod.rs +++ b/src/proxy/mod.rs @@ -1,5 +1,6 @@ //! Proxy Defs +pub mod adaptive_buffers; pub mod client; pub mod direct_relay; pub mod handshake; @@ -7,6 +8,7 @@ pub mod masking; pub mod middle_relay; pub mod route_mode; pub mod relay; +pub mod session_eviction; pub use client::ClientHandler; #[allow(unused_imports)] diff --git a/src/proxy/relay.rs b/src/proxy/relay.rs index a742e33..2b12d5a 100644 --- a/src/proxy/relay.rs +++ b/src/proxy/relay.rs @@ -53,17 +53,20 @@ use std::io; use std::pin::Pin; -use std::sync::{Arc, Mutex, OnceLock}; -use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::Arc; +use std::sync::atomic::{AtomicU64, Ordering}; use std::task::{Context, Poll}; use std::time::Duration; -use dashmap::DashMap; use tokio::io::{ AsyncRead, AsyncWrite, AsyncWriteExt, ReadBuf, copy_bidirectional_with_sizes, }; use tokio::time::Instant; use tracing::{debug, trace, warn}; -use crate::error::{ProxyError, Result}; +use crate::error::Result; +use crate::proxy::adaptive_buffers::{ + self, AdaptiveTier, RelaySignalSample, SessionAdaptiveController, TierTransitionReason, +}; +use crate::proxy::session_eviction::SessionLease; use crate::stats::Stats; use crate::stream::BufferPool; @@ -80,6 +83,7 @@ const ACTIVITY_TIMEOUT: Duration = Duration::from_secs(1800); /// 10 seconds gives responsive timeout detection (±10s accuracy) /// without measurable overhead from atomic reads. const WATCHDOG_INTERVAL: Duration = Duration::from_secs(10); +const ADAPTIVE_TICK: Duration = Duration::from_millis(250); // ============= CombinedStream ============= @@ -156,6 +160,16 @@ struct SharedCounters { s2c_ops: AtomicU64, /// Milliseconds since relay epoch of last I/O activity last_activity_ms: AtomicU64, + /// Bytes requested to write to client (S→C direction). + s2c_requested_bytes: AtomicU64, + /// Total write operations for S→C direction. + s2c_write_ops: AtomicU64, + /// Number of partial writes to client. + s2c_partial_writes: AtomicU64, + /// Number of times S→C poll_write returned Pending. + s2c_pending_writes: AtomicU64, + /// Consecutive pending writes in S→C direction. + s2c_consecutive_pending_writes: AtomicU64, } impl SharedCounters { @@ -166,6 +180,11 @@ impl SharedCounters { c2s_ops: AtomicU64::new(0), s2c_ops: AtomicU64::new(0), last_activity_ms: AtomicU64::new(0), + s2c_requested_bytes: AtomicU64::new(0), + s2c_write_ops: AtomicU64::new(0), + s2c_partial_writes: AtomicU64::new(0), + s2c_pending_writes: AtomicU64::new(0), + s2c_consecutive_pending_writes: AtomicU64::new(0), } } @@ -206,10 +225,6 @@ struct StatsIo { counters: Arc, stats: Arc, user: String, - quota_limit: Option, - quota_exceeded: Arc, - quota_read_wake_scheduled: bool, - quota_write_wake_scheduled: bool, epoch: Instant, } @@ -219,64 +234,11 @@ impl StatsIo { counters: Arc, stats: Arc, user: String, - quota_limit: Option, - quota_exceeded: Arc, epoch: Instant, ) -> Self { // Mark initial activity so the watchdog doesn't fire before data flows counters.touch(Instant::now(), epoch); - Self { - inner, - counters, - stats, - user, - quota_limit, - quota_exceeded, - quota_read_wake_scheduled: false, - quota_write_wake_scheduled: false, - epoch, - } - } -} - -#[derive(Debug)] -struct QuotaIoSentinel; - -impl std::fmt::Display for QuotaIoSentinel { - fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result { - f.write_str("user data quota exceeded") - } -} - -impl std::error::Error for QuotaIoSentinel {} - -fn quota_io_error() -> io::Error { - io::Error::new(io::ErrorKind::PermissionDenied, QuotaIoSentinel) -} - -fn is_quota_io_error(err: &io::Error) -> bool { - err.kind() == io::ErrorKind::PermissionDenied - && err - .get_ref() - .and_then(|source| source.downcast_ref::()) - .is_some() -} - -static QUOTA_USER_LOCKS: OnceLock>>> = OnceLock::new(); - -fn quota_user_lock(user: &str) -> Arc> { - let locks = QUOTA_USER_LOCKS.get_or_init(DashMap::new); - if let Some(existing) = locks.get(user) { - return Arc::clone(existing.value()); - } - - let created = Arc::new(Mutex::new(())); - match locks.entry(user.to_string()) { - dashmap::mapref::entry::Entry::Occupied(entry) => Arc::clone(entry.get()), - dashmap::mapref::entry::Entry::Vacant(entry) => { - entry.insert(Arc::clone(&created)); - created - } + Self { inner, counters, stats, user, epoch } } } @@ -287,42 +249,6 @@ impl AsyncRead for StatsIo { buf: &mut ReadBuf<'_>, ) -> Poll> { let this = self.get_mut(); - if this.quota_exceeded.load(Ordering::Relaxed) { - return Poll::Ready(Err(quota_io_error())); - } - - let quota_lock = this - .quota_limit - .is_some() - .then(|| quota_user_lock(&this.user)); - let _quota_guard = if let Some(lock) = quota_lock.as_ref() { - match lock.try_lock() { - Ok(guard) => { - this.quota_read_wake_scheduled = false; - Some(guard) - } - Err(_) => { - if !this.quota_read_wake_scheduled { - this.quota_read_wake_scheduled = true; - let waker = cx.waker().clone(); - tokio::task::spawn(async move { - tokio::task::yield_now().await; - waker.wake(); - }); - } - return Poll::Pending; - } - } - } else { - None - }; - - if let Some(limit) = this.quota_limit - && this.stats.get_user_total_octets(&this.user) >= limit - { - this.quota_exceeded.store(true, Ordering::Relaxed); - return Poll::Ready(Err(quota_io_error())); - } let before = buf.filled().len(); match Pin::new(&mut this.inner).poll_read(cx, buf) { @@ -337,13 +263,6 @@ impl AsyncRead for StatsIo { this.stats.add_user_octets_from(&this.user, n as u64); this.stats.increment_user_msgs_from(&this.user); - if let Some(limit) = this.quota_limit - && this.stats.get_user_total_octets(&this.user) >= limit - { - this.quota_exceeded.store(true, Ordering::Relaxed); - return Poll::Ready(Err(quota_io_error())); - } - trace!(user = %this.user, bytes = n, "C->S"); } Poll::Ready(Ok(())) @@ -360,57 +279,21 @@ impl AsyncWrite for StatsIo { buf: &[u8], ) -> Poll> { let this = self.get_mut(); - if this.quota_exceeded.load(Ordering::Relaxed) { - return Poll::Ready(Err(quota_io_error())); - } + this.counters + .s2c_requested_bytes + .fetch_add(buf.len() as u64, Ordering::Relaxed); - let quota_lock = this - .quota_limit - .is_some() - .then(|| quota_user_lock(&this.user)); - let _quota_guard = if let Some(lock) = quota_lock.as_ref() { - match lock.try_lock() { - Ok(guard) => { - this.quota_write_wake_scheduled = false; - Some(guard) - } - Err(_) => { - if !this.quota_write_wake_scheduled { - this.quota_write_wake_scheduled = true; - let waker = cx.waker().clone(); - tokio::task::spawn(async move { - tokio::task::yield_now().await; - waker.wake(); - }); - } - return Poll::Pending; - } - } - } else { - None - }; - - let write_buf = if let Some(limit) = this.quota_limit { - let used = this.stats.get_user_total_octets(&this.user); - if used >= limit { - this.quota_exceeded.store(true, Ordering::Relaxed); - return Poll::Ready(Err(quota_io_error())); - } - - let remaining = (limit - used) as usize; - if buf.len() > remaining { - // Fail closed: do not emit partial S->C payload when remaining - // quota cannot accommodate the pending write request. - this.quota_exceeded.store(true, Ordering::Relaxed); - return Poll::Ready(Err(quota_io_error())); - } - buf - } else { - buf - }; - - match Pin::new(&mut this.inner).poll_write(cx, write_buf) { + match Pin::new(&mut this.inner).poll_write(cx, buf) { Poll::Ready(Ok(n)) => { + this.counters.s2c_write_ops.fetch_add(1, Ordering::Relaxed); + this.counters + .s2c_consecutive_pending_writes + .store(0, Ordering::Relaxed); + if n < buf.len() { + this.counters + .s2c_partial_writes + .fetch_add(1, Ordering::Relaxed); + } if n > 0 { // S→C: data written to client this.counters.s2c_bytes.fetch_add(n as u64, Ordering::Relaxed); @@ -420,17 +303,19 @@ impl AsyncWrite for StatsIo { this.stats.add_user_octets_to(&this.user, n as u64); this.stats.increment_user_msgs_to(&this.user); - if let Some(limit) = this.quota_limit - && this.stats.get_user_total_octets(&this.user) >= limit - { - this.quota_exceeded.store(true, Ordering::Relaxed); - return Poll::Ready(Err(quota_io_error())); - } - trace!(user = %this.user, bytes = n, "S->C"); } Poll::Ready(Ok(n)) } + Poll::Pending => { + this.counters + .s2c_pending_writes + .fetch_add(1, Ordering::Relaxed); + this.counters + .s2c_consecutive_pending_writes + .fetch_add(1, Ordering::Relaxed); + Poll::Pending + } other => other, } } @@ -463,8 +348,7 @@ impl AsyncWrite for StatsIo { /// - Per-user stats: bytes and ops counted per direction /// - Periodic rate logging: every 10 seconds when active /// - Clean shutdown: both write sides are shut down on exit -/// - Error propagation: quota exits return `ProxyError::DataQuotaExceeded`, -/// other I/O failures are returned as `ProxyError::Io` +/// - Error propagation: I/O errors are returned as `ProxyError::Io` pub async fn relay_bidirectional( client_reader: CR, client_writer: CW, @@ -473,9 +357,11 @@ pub async fn relay_bidirectional( c2s_buf_size: usize, s2c_buf_size: usize, user: &str, + dc_idx: i16, stats: Arc, - quota_limit: Option, _buffer_pool: Arc, + session_lease: SessionLease, + seed_tier: AdaptiveTier, ) -> Result<()> where CR: AsyncRead + Unpin + Send + 'static, @@ -485,7 +371,6 @@ where { let epoch = Instant::now(); let counters = Arc::new(SharedCounters::new()); - let quota_exceeded = Arc::new(AtomicBool::new(false)); let user_owned = user.to_string(); // ── Combine split halves into bidirectional streams ────────────── @@ -498,31 +383,43 @@ where Arc::clone(&counters), Arc::clone(&stats), user_owned.clone(), - quota_limit, - Arc::clone("a_exceeded), epoch, ); // ── Watchdog: activity timeout + periodic rate logging ────────── let wd_counters = Arc::clone(&counters); let wd_user = user_owned.clone(); - let wd_quota_exceeded = Arc::clone("a_exceeded); + let wd_dc = dc_idx; + let wd_stats = Arc::clone(&stats); + let wd_session = session_lease.clone(); let watchdog = async { - let mut prev_c2s: u64 = 0; - let mut prev_s2c: u64 = 0; + let mut prev_c2s_log: u64 = 0; + let mut prev_s2c_log: u64 = 0; + let mut prev_c2s_sample: u64 = 0; + let mut prev_s2c_requested_sample: u64 = 0; + let mut prev_s2c_written_sample: u64 = 0; + let mut prev_s2c_write_ops_sample: u64 = 0; + let mut prev_s2c_partial_sample: u64 = 0; + let mut accumulated_log = Duration::ZERO; + let mut adaptive = SessionAdaptiveController::new(seed_tier); loop { - tokio::time::sleep(WATCHDOG_INTERVAL).await; + tokio::time::sleep(ADAPTIVE_TICK).await; + + if wd_session.is_stale() { + wd_stats.increment_reconnect_stale_close_total(); + warn!( + user = %wd_user, + dc = wd_dc, + "Session evicted by reconnect" + ); + return; + } let now = Instant::now(); let idle = wd_counters.idle_duration(now, epoch); - if wd_quota_exceeded.load(Ordering::Relaxed) { - warn!(user = %wd_user, "User data quota reached, closing relay"); - return; - } - // ── Activity timeout ──────────────────────────────────── if idle >= ACTIVITY_TIMEOUT { let c2s = wd_counters.c2s_bytes.load(Ordering::Relaxed); @@ -537,11 +434,80 @@ where return; // Causes select! to cancel copy_bidirectional } + let c2s_total = wd_counters.c2s_bytes.load(Ordering::Relaxed); + let s2c_requested_total = wd_counters + .s2c_requested_bytes + .load(Ordering::Relaxed); + let s2c_written_total = wd_counters.s2c_bytes.load(Ordering::Relaxed); + let s2c_write_ops_total = wd_counters + .s2c_write_ops + .load(Ordering::Relaxed); + let s2c_partial_total = wd_counters + .s2c_partial_writes + .load(Ordering::Relaxed); + let consecutive_pending = wd_counters + .s2c_consecutive_pending_writes + .load(Ordering::Relaxed) as u32; + + let sample = RelaySignalSample { + c2s_bytes: c2s_total.saturating_sub(prev_c2s_sample), + s2c_requested_bytes: s2c_requested_total + .saturating_sub(prev_s2c_requested_sample), + s2c_written_bytes: s2c_written_total + .saturating_sub(prev_s2c_written_sample), + s2c_write_ops: s2c_write_ops_total + .saturating_sub(prev_s2c_write_ops_sample), + s2c_partial_writes: s2c_partial_total + .saturating_sub(prev_s2c_partial_sample), + s2c_consecutive_pending_writes: consecutive_pending, + }; + + if let Some(transition) = adaptive.observe(sample, ADAPTIVE_TICK.as_secs_f64()) { + match transition.reason { + TierTransitionReason::SoftConfirmed => { + wd_stats.increment_relay_adaptive_promotions_total(); + } + TierTransitionReason::HardPressure => { + wd_stats.increment_relay_adaptive_promotions_total(); + wd_stats.increment_relay_adaptive_hard_promotions_total(); + } + TierTransitionReason::QuietDemotion => { + wd_stats.increment_relay_adaptive_demotions_total(); + } + } + adaptive_buffers::record_user_tier(&wd_user, adaptive.max_tier_seen()); + debug!( + user = %wd_user, + dc = wd_dc, + from_tier = transition.from.as_u8(), + to_tier = transition.to.as_u8(), + reason = ?transition.reason, + throughput_ema_bps = sample + .c2s_bytes + .max(sample.s2c_written_bytes) + .saturating_mul(8) + .saturating_mul(4), + "Adaptive relay tier transition" + ); + } + + prev_c2s_sample = c2s_total; + prev_s2c_requested_sample = s2c_requested_total; + prev_s2c_written_sample = s2c_written_total; + prev_s2c_write_ops_sample = s2c_write_ops_total; + prev_s2c_partial_sample = s2c_partial_total; + + accumulated_log = accumulated_log.saturating_add(ADAPTIVE_TICK); + if accumulated_log < WATCHDOG_INTERVAL { + continue; + } + accumulated_log = Duration::ZERO; + // ── Periodic rate logging ─────────────────────────────── let c2s = wd_counters.c2s_bytes.load(Ordering::Relaxed); let s2c = wd_counters.s2c_bytes.load(Ordering::Relaxed); - let c2s_delta = c2s - prev_c2s; - let s2c_delta = s2c - prev_s2c; + let c2s_delta = c2s.saturating_sub(prev_c2s_log); + let s2c_delta = s2c.saturating_sub(prev_s2c_log); if c2s_delta > 0 || s2c_delta > 0 { let secs = WATCHDOG_INTERVAL.as_secs_f64(); @@ -555,8 +521,8 @@ where ); } - prev_c2s = c2s; - prev_s2c = s2c; + prev_c2s_log = c2s; + prev_s2c_log = s2c; } }; @@ -591,6 +557,7 @@ where let c2s_ops = counters.c2s_ops.load(Ordering::Relaxed); let s2c_ops = counters.s2c_ops.load(Ordering::Relaxed); let duration = epoch.elapsed(); + adaptive_buffers::record_user_tier(&user_owned, seed_tier); match copy_result { Some(Ok((c2s, s2c))) => { @@ -606,22 +573,6 @@ where ); Ok(()) } - Some(Err(e)) if is_quota_io_error(&e) => { - let c2s = counters.c2s_bytes.load(Ordering::Relaxed); - let s2c = counters.s2c_bytes.load(Ordering::Relaxed); - warn!( - user = %user_owned, - c2s_bytes = c2s, - s2c_bytes = s2c, - c2s_msgs = c2s_ops, - s2c_msgs = s2c_ops, - duration_secs = duration.as_secs(), - "Data quota reached, closing relay" - ); - Err(ProxyError::DataQuotaExceeded { - user: user_owned.clone(), - }) - } Some(Err(e)) => { // I/O error in one of the directions let c2s = counters.c2s_bytes.load(Ordering::Relaxed); @@ -655,9 +606,3 @@ where } } } - -#[cfg(test)] -#[path = "relay_security_tests.rs"] -mod security_tests; -#[path = "relay_adversarial_tests.rs"] -mod adversarial_tests; \ No newline at end of file diff --git a/src/proxy/session_eviction.rs b/src/proxy/session_eviction.rs new file mode 100644 index 0000000..c735cae --- /dev/null +++ b/src/proxy/session_eviction.rs @@ -0,0 +1,46 @@ +/// Session eviction is intentionally disabled in runtime. +/// +/// The initial `user+dc` single-lease model caused valid parallel client +/// connections to evict each other. Keep the API shape for compatibility, +/// but make it a no-op until a safer policy is introduced. + +#[derive(Debug, Clone, Default)] +pub struct SessionLease; + +impl SessionLease { + pub fn is_stale(&self) -> bool { + false + } + + #[allow(dead_code)] + pub fn release(&self) {} +} + +pub struct RegistrationResult { + pub lease: SessionLease, + pub replaced_existing: bool, +} + +pub fn register_session(_user: &str, _dc_idx: i16) -> RegistrationResult { + RegistrationResult { + lease: SessionLease, + replaced_existing: false, + } +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_session_eviction_disabled_behavior() { + let first = register_session("alice", 2); + let second = register_session("alice", 2); + assert!(!first.replaced_existing); + assert!(!second.replaced_existing); + assert!(!first.lease.is_stale()); + assert!(!second.lease.is_stale()); + first.lease.release(); + second.lease.release(); + } +} diff --git a/src/stats/mod.rs b/src/stats/mod.rs index 3c79448..f31e429 100644 --- a/src/stats/mod.rs +++ b/src/stats/mod.rs @@ -6,7 +6,6 @@ pub mod beobachten; pub mod telemetry; use std::sync::atomic::{AtomicBool, AtomicU8, AtomicU64, Ordering}; -use std::sync::Arc; use std::time::{Duration, Instant, SystemTime, UNIX_EPOCH}; use dashmap::DashMap; use parking_lot::Mutex; @@ -20,46 +19,6 @@ use tracing::debug; use crate::config::{MeTelemetryLevel, MeWriterPickMode}; use self::telemetry::TelemetryPolicy; -#[derive(Clone, Copy)] -enum RouteConnectionGauge { - Direct, - Middle, -} - -#[must_use = "RouteConnectionLease must be kept alive to hold the connection gauge increment"] -pub struct RouteConnectionLease { - stats: Arc, - gauge: RouteConnectionGauge, - active: bool, -} - -impl RouteConnectionLease { - fn new(stats: Arc, gauge: RouteConnectionGauge) -> Self { - Self { - stats, - gauge, - active: true, - } - } - - #[cfg(test)] - fn disarm(&mut self) { - self.active = false; - } -} - -impl Drop for RouteConnectionLease { - fn drop(&mut self) { - if !self.active { - return; - } - match self.gauge { - RouteConnectionGauge::Direct => self.stats.decrement_current_connections_direct(), - RouteConnectionGauge::Middle => self.stats.decrement_current_connections_me(), - } - } -} - // ============= Stats ============= #[derive(Default)] @@ -161,6 +120,8 @@ pub struct Stats { pool_swap_total: AtomicU64, pool_drain_active: AtomicU64, pool_force_close_total: AtomicU64, + pool_drain_soft_evict_total: AtomicU64, + pool_drain_soft_evict_writer_total: AtomicU64, pool_stale_pick_total: AtomicU64, me_writer_removed_total: AtomicU64, me_writer_removed_unexpected_total: AtomicU64, @@ -174,6 +135,11 @@ pub struct Stats { me_inline_recovery_total: AtomicU64, ip_reservation_rollback_tcp_limit_total: AtomicU64, ip_reservation_rollback_quota_limit_total: AtomicU64, + relay_adaptive_promotions_total: AtomicU64, + relay_adaptive_demotions_total: AtomicU64, + relay_adaptive_hard_promotions_total: AtomicU64, + reconnect_evict_total: AtomicU64, + reconnect_stale_close_total: AtomicU64, telemetry_core_enabled: AtomicBool, telemetry_user_enabled: AtomicBool, telemetry_me_level: AtomicU8, @@ -326,15 +292,35 @@ impl Stats { pub fn decrement_current_connections_me(&self) { Self::decrement_atomic_saturating(&self.current_connections_me); } - - pub fn acquire_direct_connection_lease(self: &Arc) -> RouteConnectionLease { - self.increment_current_connections_direct(); - RouteConnectionLease::new(self.clone(), RouteConnectionGauge::Direct) + pub fn increment_relay_adaptive_promotions_total(&self) { + if self.telemetry_core_enabled() { + self.relay_adaptive_promotions_total + .fetch_add(1, Ordering::Relaxed); + } } - - pub fn acquire_me_connection_lease(self: &Arc) -> RouteConnectionLease { - self.increment_current_connections_me(); - RouteConnectionLease::new(self.clone(), RouteConnectionGauge::Middle) + pub fn increment_relay_adaptive_demotions_total(&self) { + if self.telemetry_core_enabled() { + self.relay_adaptive_demotions_total + .fetch_add(1, Ordering::Relaxed); + } + } + pub fn increment_relay_adaptive_hard_promotions_total(&self) { + if self.telemetry_core_enabled() { + self.relay_adaptive_hard_promotions_total + .fetch_add(1, Ordering::Relaxed); + } + } + pub fn increment_reconnect_evict_total(&self) { + if self.telemetry_core_enabled() { + self.reconnect_evict_total + .fetch_add(1, Ordering::Relaxed); + } + } + pub fn increment_reconnect_stale_close_total(&self) { + if self.telemetry_core_enabled() { + self.reconnect_stale_close_total + .fetch_add(1, Ordering::Relaxed); + } } pub fn increment_handshake_timeouts(&self) { if self.telemetry_core_enabled() { @@ -731,6 +717,18 @@ impl Stats { self.pool_force_close_total.fetch_add(1, Ordering::Relaxed); } } + pub fn increment_pool_drain_soft_evict_total(&self) { + if self.telemetry_me_allows_normal() { + self.pool_drain_soft_evict_total + .fetch_add(1, Ordering::Relaxed); + } + } + pub fn increment_pool_drain_soft_evict_writer_total(&self) { + if self.telemetry_me_allows_normal() { + self.pool_drain_soft_evict_writer_total + .fetch_add(1, Ordering::Relaxed); + } + } pub fn increment_pool_stale_pick_total(&self) { if self.telemetry_me_allows_normal() { self.pool_stale_pick_total.fetch_add(1, Ordering::Relaxed); @@ -984,6 +982,22 @@ impl Stats { self.get_current_connections_direct() .saturating_add(self.get_current_connections_me()) } + pub fn get_relay_adaptive_promotions_total(&self) -> u64 { + self.relay_adaptive_promotions_total.load(Ordering::Relaxed) + } + pub fn get_relay_adaptive_demotions_total(&self) -> u64 { + self.relay_adaptive_demotions_total.load(Ordering::Relaxed) + } + pub fn get_relay_adaptive_hard_promotions_total(&self) -> u64 { + self.relay_adaptive_hard_promotions_total + .load(Ordering::Relaxed) + } + pub fn get_reconnect_evict_total(&self) -> u64 { + self.reconnect_evict_total.load(Ordering::Relaxed) + } + pub fn get_reconnect_stale_close_total(&self) -> u64 { + self.reconnect_stale_close_total.load(Ordering::Relaxed) + } pub fn get_me_keepalive_sent(&self) -> u64 { self.me_keepalive_sent.load(Ordering::Relaxed) } pub fn get_me_keepalive_failed(&self) -> u64 { self.me_keepalive_failed.load(Ordering::Relaxed) } pub fn get_me_keepalive_pong(&self) -> u64 { self.me_keepalive_pong.load(Ordering::Relaxed) } @@ -1236,6 +1250,12 @@ impl Stats { pub fn get_pool_force_close_total(&self) -> u64 { self.pool_force_close_total.load(Ordering::Relaxed) } + pub fn get_pool_drain_soft_evict_total(&self) -> u64 { + self.pool_drain_soft_evict_total.load(Ordering::Relaxed) + } + pub fn get_pool_drain_soft_evict_writer_total(&self) -> u64 { + self.pool_drain_soft_evict_writer_total.load(Ordering::Relaxed) + } pub fn get_pool_stale_pick_total(&self) -> u64 { self.pool_stale_pick_total.load(Ordering::Relaxed) } @@ -1307,35 +1327,11 @@ impl Stats { Self::touch_user_stats(stats.value()); stats.curr_connects.fetch_add(1, Ordering::Relaxed); } - - pub fn try_acquire_user_curr_connects(&self, user: &str, limit: Option) -> bool { - if !self.telemetry_user_enabled() { - return true; - } - - self.maybe_cleanup_user_stats(); - let stats = self.user_stats.entry(user.to_string()).or_default(); - Self::touch_user_stats(stats.value()); - - let counter = &stats.curr_connects; - let mut current = counter.load(Ordering::Relaxed); - loop { - if let Some(max) = limit && current >= max { - return false; - } - match counter.compare_exchange_weak( - current, - current.saturating_add(1), - Ordering::Relaxed, - Ordering::Relaxed, - ) { - Ok(_) => return true, - Err(actual) => current = actual, - } - } - } pub fn decrement_user_curr_connects(&self, user: &str) { + if !self.telemetry_user_enabled() { + return; + } self.maybe_cleanup_user_stats(); if let Some(stats) = self.user_stats.get(user) { Self::touch_user_stats(stats.value()); @@ -1711,6 +1707,7 @@ impl ReplayChecker { let after = shard.len(); cleaned += before.saturating_sub(after); } + for shard_mutex in &self.tls_shards { let mut shard = shard_mutex.lock(); let before = shard.len(); @@ -1851,11 +1848,3 @@ mod tests { assert_eq!(checker.stats().total_entries, 500); } } - -#[cfg(test)] -#[path = "connection_lease_security_tests.rs"] -mod connection_lease_security_tests; - -#[cfg(test)] -#[path = "replay_checker_security_tests.rs"] -mod replay_checker_security_tests; diff --git a/src/stream/buffer_pool.rs b/src/stream/buffer_pool.rs index 9c46922..dac0fb5 100644 --- a/src/stream/buffer_pool.rs +++ b/src/stream/buffer_pool.rs @@ -14,8 +14,7 @@ use std::sync::Arc; // ============= Configuration ============= /// Default buffer size -/// CHANGED: Reduced from 64KB to 16KB to match TLS record size and prevent bufferbloat. -pub const DEFAULT_BUFFER_SIZE: usize = 16 * 1024; +pub const DEFAULT_BUFFER_SIZE: usize = 64 * 1024; /// Default maximum number of pooled buffers pub const DEFAULT_MAX_BUFFERS: usize = 1024; diff --git a/src/transport/middle_proxy/config_updater.rs b/src/transport/middle_proxy/config_updater.rs index b6a0160..43a3569 100644 --- a/src/transport/middle_proxy/config_updater.rs +++ b/src/transport/middle_proxy/config_updater.rs @@ -299,6 +299,11 @@ async fn run_update_cycle( cfg.general.hardswap, cfg.general.me_pool_drain_ttl_secs, cfg.general.me_pool_drain_threshold, + cfg.general.me_pool_drain_soft_evict_enabled, + cfg.general.me_pool_drain_soft_evict_grace_secs, + cfg.general.me_pool_drain_soft_evict_per_writer, + cfg.general.me_pool_drain_soft_evict_budget_per_core, + cfg.general.me_pool_drain_soft_evict_cooldown_ms, cfg.general.effective_me_pool_force_close_secs(), cfg.general.me_pool_min_fresh_ratio, cfg.general.me_hardswap_warmup_delay_min_ms, @@ -526,6 +531,11 @@ pub async fn me_config_updater( cfg.general.hardswap, cfg.general.me_pool_drain_ttl_secs, cfg.general.me_pool_drain_threshold, + cfg.general.me_pool_drain_soft_evict_enabled, + cfg.general.me_pool_drain_soft_evict_grace_secs, + cfg.general.me_pool_drain_soft_evict_per_writer, + cfg.general.me_pool_drain_soft_evict_budget_per_core, + cfg.general.me_pool_drain_soft_evict_cooldown_ms, cfg.general.effective_me_pool_force_close_secs(), cfg.general.me_pool_min_fresh_ratio, cfg.general.me_hardswap_warmup_delay_min_ms, diff --git a/src/transport/middle_proxy/health.rs b/src/transport/middle_proxy/health.rs index a6b1031..0b9b749 100644 --- a/src/transport/middle_proxy/health.rs +++ b/src/transport/middle_proxy/health.rs @@ -28,6 +28,8 @@ const HEALTH_RECONNECT_BUDGET_MAX: usize = 128; const HEALTH_DRAIN_CLOSE_BUDGET_PER_CORE: usize = 16; const HEALTH_DRAIN_CLOSE_BUDGET_MIN: usize = 16; const HEALTH_DRAIN_CLOSE_BUDGET_MAX: usize = 256; +const HEALTH_DRAIN_SOFT_EVICT_BUDGET_MIN: usize = 8; +const HEALTH_DRAIN_SOFT_EVICT_BUDGET_MAX: usize = 256; #[derive(Debug, Clone)] struct DcFloorPlanEntry { @@ -66,6 +68,7 @@ pub async fn me_health_monitor(pool: Arc, rng: Arc, _min_c let mut adaptive_recover_until: HashMap<(i32, IpFamily), Instant> = HashMap::new(); let mut floor_warn_next_allowed: HashMap<(i32, IpFamily), Instant> = HashMap::new(); let mut drain_warn_next_allowed: HashMap = HashMap::new(); + let mut drain_soft_evict_next_allowed: HashMap = HashMap::new(); let mut degraded_interval = true; loop { let interval = if degraded_interval { @@ -75,7 +78,12 @@ pub async fn me_health_monitor(pool: Arc, rng: Arc, _min_c }; tokio::time::sleep(interval).await; pool.prune_closed_writers().await; - reap_draining_writers(&pool, &mut drain_warn_next_allowed).await; + reap_draining_writers( + &pool, + &mut drain_warn_next_allowed, + &mut drain_soft_evict_next_allowed, + ) + .await; let v4_degraded = check_family( IpFamily::V4, &pool, @@ -117,6 +125,7 @@ pub async fn me_health_monitor(pool: Arc, rng: Arc, _min_c pub(super) async fn reap_draining_writers( pool: &Arc, warn_next_allowed: &mut HashMap, + soft_evict_next_allowed: &mut HashMap, ) { let now_epoch_secs = MePool::now_epoch_secs(); let now = Instant::now(); @@ -124,12 +133,12 @@ pub(super) async fn reap_draining_writers( let drain_threshold = pool .me_pool_drain_threshold .load(std::sync::atomic::Ordering::Relaxed); + let writers = pool.writers.read().await.clone(); let activity = pool.registry.writer_activity_snapshot().await; - let mut draining_writers = Vec::::new(); + let mut draining_writers = Vec::new(); let mut empty_writer_ids = Vec::::new(); let mut force_close_writer_ids = Vec::::new(); - let writers = pool.writers.read().await; - for writer in writers.iter() { + for writer in writers { if !writer.draining.load(std::sync::atomic::Ordering::Relaxed) { continue; } @@ -143,38 +152,23 @@ pub(super) async fn reap_draining_writers( empty_writer_ids.push(writer.id); continue; } - draining_writers.push(DrainingWriterSnapshot { - id: writer.id, - writer_dc: writer.writer_dc, - addr: writer.addr, - generation: writer.generation, - created_at: writer.created_at, - draining_started_at_epoch_secs: writer - .draining_started_at_epoch_secs - .load(std::sync::atomic::Ordering::Relaxed), - drain_deadline_epoch_secs: writer - .drain_deadline_epoch_secs - .load(std::sync::atomic::Ordering::Relaxed), - allow_drain_fallback: writer - .allow_drain_fallback - .load(std::sync::atomic::Ordering::Relaxed), - }); + draining_writers.push(writer); } - drop(writers); - let overflow = if drain_threshold > 0 && draining_writers.len() > drain_threshold as usize { - draining_writers.len().saturating_sub(drain_threshold as usize) - } else { - 0 - }; - - if overflow > 0 { + if drain_threshold > 0 && draining_writers.len() > drain_threshold as usize { draining_writers.sort_by(|left, right| { - left.draining_started_at_epoch_secs - .cmp(&right.draining_started_at_epoch_secs) + let left_started = left + .draining_started_at_epoch_secs + .load(std::sync::atomic::Ordering::Relaxed); + let right_started = right + .draining_started_at_epoch_secs + .load(std::sync::atomic::Ordering::Relaxed); + left_started + .cmp(&right_started) .then_with(|| left.created_at.cmp(&right.created_at)) .then_with(|| left.id.cmp(&right.id)) }); + let overflow = draining_writers.len().saturating_sub(drain_threshold as usize); warn!( draining_writers = draining_writers.len(), me_pool_drain_threshold = drain_threshold, @@ -186,10 +180,15 @@ pub(super) async fn reap_draining_writers( } } - for writer in draining_writers { + let mut active_draining_writer_ids = HashSet::with_capacity(draining_writers.len()); + for writer in &draining_writers { + active_draining_writer_ids.insert(writer.id); + let drain_started_at_epoch_secs = writer + .draining_started_at_epoch_secs + .load(std::sync::atomic::Ordering::Relaxed); if drain_ttl_secs > 0 - && writer.draining_started_at_epoch_secs != 0 - && now_epoch_secs.saturating_sub(writer.draining_started_at_epoch_secs) > drain_ttl_secs + && drain_started_at_epoch_secs != 0 + && now_epoch_secs.saturating_sub(drain_started_at_epoch_secs) > drain_ttl_secs && should_emit_writer_warn( warn_next_allowed, writer.id, @@ -204,14 +203,99 @@ pub(super) async fn reap_draining_writers( generation = writer.generation, drain_ttl_secs, force_close_secs = pool.me_pool_force_close_secs.load(std::sync::atomic::Ordering::Relaxed), - allow_drain_fallback = writer.allow_drain_fallback, + allow_drain_fallback = writer.allow_drain_fallback.load(std::sync::atomic::Ordering::Relaxed), "ME draining writer remains non-empty past drain TTL" ); } - if writer.drain_deadline_epoch_secs != 0 && now_epoch_secs >= writer.drain_deadline_epoch_secs - { + let deadline_epoch_secs = writer + .drain_deadline_epoch_secs + .load(std::sync::atomic::Ordering::Relaxed); + if deadline_epoch_secs != 0 && now_epoch_secs >= deadline_epoch_secs { warn!(writer_id = writer.id, "Drain timeout, force-closing"); force_close_writer_ids.push(writer.id); + active_draining_writer_ids.remove(&writer.id); + } + } + + warn_next_allowed.retain(|writer_id, _| active_draining_writer_ids.contains(writer_id)); + soft_evict_next_allowed.retain(|writer_id, _| active_draining_writer_ids.contains(writer_id)); + + if pool.drain_soft_evict_enabled() && drain_ttl_secs > 0 && !draining_writers.is_empty() { + let mut force_close_ids = HashSet::::with_capacity(force_close_writer_ids.len()); + for writer_id in &force_close_writer_ids { + force_close_ids.insert(*writer_id); + } + let soft_grace_secs = pool.drain_soft_evict_grace_secs(); + let soft_trigger_age_secs = drain_ttl_secs.saturating_add(soft_grace_secs); + let per_writer_limit = pool.drain_soft_evict_per_writer(); + let soft_budget = health_drain_soft_evict_budget(pool); + let soft_cooldown = pool.drain_soft_evict_cooldown(); + let mut soft_evicted_total = 0usize; + + for writer in &draining_writers { + if soft_evicted_total >= soft_budget { + break; + } + if force_close_ids.contains(&writer.id) { + continue; + } + if pool.writer_accepts_new_binding(writer) { + continue; + } + let started_epoch_secs = writer + .draining_started_at_epoch_secs + .load(std::sync::atomic::Ordering::Relaxed); + if started_epoch_secs == 0 + || now_epoch_secs.saturating_sub(started_epoch_secs) < soft_trigger_age_secs + { + continue; + } + if !should_emit_writer_warn( + soft_evict_next_allowed, + writer.id, + now, + soft_cooldown, + ) { + continue; + } + + let remaining_budget = soft_budget.saturating_sub(soft_evicted_total); + let limit = per_writer_limit.min(remaining_budget); + if limit == 0 { + break; + } + let conn_ids = pool + .registry + .bound_conn_ids_for_writer_limited(writer.id, limit) + .await; + if conn_ids.is_empty() { + continue; + } + + let mut evicted_for_writer = 0usize; + for conn_id in conn_ids { + if pool.registry.evict_bound_conn_if_writer(conn_id, writer.id).await { + evicted_for_writer = evicted_for_writer.saturating_add(1); + soft_evicted_total = soft_evicted_total.saturating_add(1); + pool.stats.increment_pool_drain_soft_evict_total(); + if soft_evicted_total >= soft_budget { + break; + } + } + } + + if evicted_for_writer > 0 { + pool.stats.increment_pool_drain_soft_evict_writer_total(); + info!( + writer_id = writer.id, + writer_dc = writer.writer_dc, + endpoint = %writer.addr, + drained_connections = evicted_for_writer, + soft_budget, + soft_trigger_age_secs, + "ME draining writer soft-evicted bound clients" + ); + } } } @@ -239,9 +323,7 @@ pub(super) async fn reap_draining_writers( if !closed_writer_ids.insert(writer_id) { continue; } - if !pool.remove_writer_if_empty(writer_id).await { - continue; - } + pool.remove_writer_and_close_clients(writer_id).await; closed_total = closed_total.saturating_add(1); } @@ -254,18 +336,6 @@ pub(super) async fn reap_draining_writers( "ME draining close backlog deferred to next health cycle" ); } - - // Keep warn cooldown state for draining writers still present in the pool; - // drop state only once a writer is actually removed. - let active_draining_writer_ids = { - let writers = pool.writers.read().await; - writers - .iter() - .filter(|writer| writer.draining.load(std::sync::atomic::Ordering::Relaxed)) - .map(|writer| writer.id) - .collect::>() - }; - warn_next_allowed.retain(|writer_id, _| active_draining_writer_ids.contains(writer_id)); } pub(super) fn health_drain_close_budget() -> usize { @@ -277,16 +347,17 @@ pub(super) fn health_drain_close_budget() -> usize { .clamp(HEALTH_DRAIN_CLOSE_BUDGET_MIN, HEALTH_DRAIN_CLOSE_BUDGET_MAX) } -#[derive(Debug, Clone)] -struct DrainingWriterSnapshot { - id: u64, - writer_dc: i32, - addr: SocketAddr, - generation: u64, - created_at: Instant, - draining_started_at_epoch_secs: u64, - drain_deadline_epoch_secs: u64, - allow_drain_fallback: bool, +pub(super) fn health_drain_soft_evict_budget(pool: &MePool) -> usize { + let cpu_cores = std::thread::available_parallelism() + .map(std::num::NonZeroUsize::get) + .unwrap_or(1); + let per_core = pool.drain_soft_evict_budget_per_core(); + cpu_cores + .saturating_mul(per_core) + .clamp( + HEALTH_DRAIN_SOFT_EVICT_BUDGET_MIN, + HEALTH_DRAIN_SOFT_EVICT_BUDGET_MAX, + ) } fn should_emit_writer_warn( @@ -1422,15 +1493,6 @@ mod tests { me_pool_drain_threshold, ..GeneralConfig::default() }; - let mut proxy_map_v4 = HashMap::new(); - proxy_map_v4.insert( - 2, - vec![(IpAddr::V4(Ipv4Addr::new(203, 0, 113, 10)), 443)], - ); - let decision = NetworkDecision { - ipv4_me: true, - ..NetworkDecision::default() - }; MePool::new( None, vec![1u8; 32], @@ -1442,10 +1504,10 @@ mod tests { None, 12, 1200, - proxy_map_v4, + HashMap::new(), HashMap::new(), None, - decision, + NetworkDecision::default(), None, Arc::new(SecureRandom::new()), Arc::new(Stats::default()), @@ -1483,6 +1545,11 @@ mod tests { general.hardswap, general.me_pool_drain_ttl_secs, general.me_pool_drain_threshold, + general.me_pool_drain_soft_evict_enabled, + general.me_pool_drain_soft_evict_grace_secs, + general.me_pool_drain_soft_evict_per_writer, + general.me_pool_drain_soft_evict_budget_per_core, + general.me_pool_drain_soft_evict_cooldown_ms, general.effective_me_pool_force_close_secs(), general.me_pool_min_fresh_ratio, general.me_hardswap_warmup_delay_min_ms, @@ -1507,6 +1574,8 @@ mod tests { general.me_warn_rate_limit_ms, MeRouteNoWriterMode::default(), general.me_route_no_writer_wait_ms, + general.me_route_hybrid_max_wait_ms, + general.me_route_blocking_send_timeout_ms, general.me_route_inline_recovery_attempts, general.me_route_inline_recovery_wait_ms, ) @@ -1556,66 +1625,19 @@ mod tests { conn_id } - async fn insert_live_writer(pool: &Arc, writer_id: u64, writer_dc: i32) { - let (tx, _writer_rx) = mpsc::channel::(8); - let writer = MeWriter { - id: writer_id, - addr: SocketAddr::new( - IpAddr::V4(Ipv4Addr::new(203, 0, 113, (writer_id as u8).saturating_add(1))), - 4000 + writer_id as u16, - ), - source_ip: IpAddr::V4(Ipv4Addr::LOCALHOST), - writer_dc, - generation: 2, - contour: Arc::new(AtomicU8::new(WriterContour::Active.as_u8())), - created_at: Instant::now(), - tx: tx.clone(), - cancel: CancellationToken::new(), - degraded: Arc::new(AtomicBool::new(false)), - rtt_ema_ms_x10: Arc::new(AtomicU32::new(0)), - draining: Arc::new(AtomicBool::new(false)), - draining_started_at_epoch_secs: Arc::new(AtomicU64::new(0)), - drain_deadline_epoch_secs: Arc::new(AtomicU64::new(0)), - allow_drain_fallback: Arc::new(AtomicBool::new(false)), - }; - pool.writers.write().await.push(writer); - pool.registry.register_writer(writer_id, tx).await; - pool.conn_count.fetch_add(1, Ordering::Relaxed); - } - #[tokio::test] async fn reap_draining_writers_force_closes_oldest_over_threshold() { let pool = make_pool(2).await; - insert_live_writer(&pool, 1, 2).await; let now_epoch_secs = MePool::now_epoch_secs(); let conn_a = insert_draining_writer(&pool, 10, now_epoch_secs.saturating_sub(30)).await; let conn_b = insert_draining_writer(&pool, 20, now_epoch_secs.saturating_sub(20)).await; let conn_c = insert_draining_writer(&pool, 30, now_epoch_secs.saturating_sub(10)).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; - let mut writer_ids: Vec = pool.writers.read().await.iter().map(|writer| writer.id).collect(); - writer_ids.sort_unstable(); - assert_eq!(writer_ids, vec![1, 20, 30]); - assert!(pool.registry.get_writer(conn_a).await.is_none()); - assert_eq!(pool.registry.get_writer(conn_b).await.unwrap().writer_id, 20); - assert_eq!(pool.registry.get_writer(conn_c).await.unwrap().writer_id, 30); - } - - #[tokio::test] - async fn reap_draining_writers_force_closes_overflow_without_replacement() { - let pool = make_pool(2).await; - let now_epoch_secs = MePool::now_epoch_secs(); - let conn_a = insert_draining_writer(&pool, 10, now_epoch_secs.saturating_sub(30)).await; - let conn_b = insert_draining_writer(&pool, 20, now_epoch_secs.saturating_sub(20)).await; - let conn_c = insert_draining_writer(&pool, 30, now_epoch_secs.saturating_sub(10)).await; - let mut warn_next_allowed = HashMap::new(); - - reap_draining_writers(&pool, &mut warn_next_allowed).await; - - let mut writer_ids: Vec = pool.writers.read().await.iter().map(|writer| writer.id).collect(); - writer_ids.sort_unstable(); + let writer_ids: Vec = pool.writers.read().await.iter().map(|writer| writer.id).collect(); assert_eq!(writer_ids, vec![20, 30]); assert!(pool.registry.get_writer(conn_a).await.is_none()); assert_eq!(pool.registry.get_writer(conn_b).await.unwrap().writer_id, 20); @@ -1630,8 +1652,9 @@ mod tests { let conn_b = insert_draining_writer(&pool, 20, now_epoch_secs.saturating_sub(20)).await; let conn_c = insert_draining_writer(&pool, 30, now_epoch_secs.saturating_sub(10)).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; let writer_ids: Vec = pool.writers.read().await.iter().map(|writer| writer.id).collect(); assert_eq!(writer_ids, vec![10, 20, 30]); diff --git a/src/transport/middle_proxy/health_adversarial_tests.rs b/src/transport/middle_proxy/health_adversarial_tests.rs index cd06fdf..e53f1c5 100644 --- a/src/transport/middle_proxy/health_adversarial_tests.rs +++ b/src/transport/middle_proxy/health_adversarial_tests.rs @@ -83,6 +83,11 @@ async fn make_pool( general.hardswap, general.me_pool_drain_ttl_secs, general.me_pool_drain_threshold, + general.me_pool_drain_soft_evict_enabled, + general.me_pool_drain_soft_evict_grace_secs, + general.me_pool_drain_soft_evict_per_writer, + general.me_pool_drain_soft_evict_budget_per_core, + general.me_pool_drain_soft_evict_cooldown_ms, general.effective_me_pool_force_close_secs(), general.me_pool_min_fresh_ratio, general.me_hardswap_warmup_delay_min_ms, @@ -107,6 +112,8 @@ async fn make_pool( general.me_warn_rate_limit_ms, MeRouteNoWriterMode::default(), general.me_route_no_writer_wait_ms, + general.me_route_hybrid_max_wait_ms, + general.me_route_blocking_send_timeout_ms, general.me_route_inline_recovery_attempts, general.me_route_inline_recovery_wait_ms, ); @@ -220,10 +227,11 @@ async fn set_writer_runtime_state( async fn reap_draining_writers_clears_warn_state_when_pool_empty() { let (pool, _rng) = make_pool(128, 1, 1).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); warn_next_allowed.insert(11, Instant::now() + Duration::from_secs(5)); warn_next_allowed.insert(22, Instant::now() + Duration::from_secs(5)); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.is_empty()); } @@ -232,6 +240,8 @@ async fn reap_draining_writers_clears_warn_state_when_pool_empty() { async fn reap_draining_writers_respects_threshold_across_multiple_overflow_cycles() { let threshold = 3u64; let (pool, _rng) = make_pool(threshold, 1, 1).await; + pool.me_pool_drain_soft_evict_enabled + .store(false, Ordering::Relaxed); let now_epoch_secs = MePool::now_epoch_secs(); for writer_id in 1..=60u64 { @@ -246,8 +256,9 @@ async fn reap_draining_writers_respects_threshold_across_multiple_overflow_cycle } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _ in 0..64 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; if writer_count(&pool).await <= threshold as usize { break; } @@ -275,11 +286,12 @@ async fn reap_draining_writers_handles_large_empty_writer_population() { } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _ in 0..24 { if writer_count(&pool).await == 0 { break; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; } assert_eq!(writer_count(&pool).await, 0); @@ -303,11 +315,12 @@ async fn reap_draining_writers_processes_mass_deadline_expiry_without_unbounded_ } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _ in 0..40 { if writer_count(&pool).await == 0 { break; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; } assert_eq!(writer_count(&pool).await, 0); @@ -318,6 +331,7 @@ async fn reap_draining_writers_maintains_warn_state_subset_property_under_bulk_c let (pool, _rng) = make_pool(128, 1, 1).await; let now_epoch_secs = MePool::now_epoch_secs(); let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for wave in 0..40u64 { for offset in 0..8u64 { @@ -331,7 +345,7 @@ async fn reap_draining_writers_maintains_warn_state_subset_property_under_bulk_c .await; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.len() <= writer_count(&pool).await); let ids = sorted_writer_ids(&pool).await; @@ -339,7 +353,7 @@ async fn reap_draining_writers_maintains_warn_state_subset_property_under_bulk_c let _ = pool.remove_writer_and_close_clients(writer_id).await; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.len() <= writer_count(&pool).await); } } @@ -361,9 +375,10 @@ async fn reap_draining_writers_budgeted_cleanup_never_increases_pool_size() { } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); let mut previous = writer_count(&pool).await; for _ in 0..32 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; let current = writer_count(&pool).await; assert!(current <= previous); previous = current; @@ -470,6 +485,7 @@ async fn reap_draining_writers_deterministic_mixed_state_churn_preserves_invaria let threshold = 9u64; let (pool, _rng) = make_pool(threshold, 1, 1).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); let mut seed = 0x9E37_79B9_7F4A_7C15u64; let mut next_writer_id = 20_000u64; let now_epoch_secs = MePool::now_epoch_secs(); @@ -492,7 +508,7 @@ async fn reap_draining_writers_deterministic_mixed_state_churn_preserves_invaria } for _round in 0..90 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; let draining_ids = draining_writer_ids(&pool).await; assert!( @@ -557,7 +573,7 @@ async fn reap_draining_writers_deterministic_mixed_state_churn_preserves_invaria } for _ in 0..64 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; if writer_count(&pool).await <= threshold as usize { break; } @@ -585,6 +601,7 @@ async fn reap_draining_writers_repeated_draining_flips_never_leave_stale_warn_st } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _round in 0..48u64 { for writer_id in 1..=24u64 { let draining = (writer_id + _round) % 3 != 0; @@ -598,7 +615,7 @@ async fn reap_draining_writers_repeated_draining_flips_never_leave_stale_warn_st .await; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; let draining_ids = draining_writer_ids(&pool).await; assert!( diff --git a/src/transport/middle_proxy/health_integration_tests.rs b/src/transport/middle_proxy/health_integration_tests.rs index 476b549..15ad4f2 100644 --- a/src/transport/middle_proxy/health_integration_tests.rs +++ b/src/transport/middle_proxy/health_integration_tests.rs @@ -81,6 +81,11 @@ async fn make_pool( general.hardswap, general.me_pool_drain_ttl_secs, general.me_pool_drain_threshold, + general.me_pool_drain_soft_evict_enabled, + general.me_pool_drain_soft_evict_grace_secs, + general.me_pool_drain_soft_evict_per_writer, + general.me_pool_drain_soft_evict_budget_per_core, + general.me_pool_drain_soft_evict_cooldown_ms, general.effective_me_pool_force_close_secs(), general.me_pool_min_fresh_ratio, general.me_hardswap_warmup_delay_min_ms, @@ -105,6 +110,8 @@ async fn make_pool( general.me_warn_rate_limit_ms, MeRouteNoWriterMode::default(), general.me_route_no_writer_wait_ms, + general.me_route_hybrid_max_wait_ms, + general.me_route_blocking_send_timeout_ms, general.me_route_inline_recovery_attempts, general.me_route_inline_recovery_wait_ms, ); diff --git a/src/transport/middle_proxy/health_regression_tests.rs b/src/transport/middle_proxy/health_regression_tests.rs index 6b6b12a..ceccbf8 100644 --- a/src/transport/middle_proxy/health_regression_tests.rs +++ b/src/transport/middle_proxy/health_regression_tests.rs @@ -39,7 +39,7 @@ async fn make_pool(me_pool_drain_threshold: u64) -> Arc { NetworkDecision::default(), None, Arc::new(SecureRandom::new()), - Arc::new(Stats::default()), + Arc::new(Stats::new()), general.me_keepalive_enabled, general.me_keepalive_interval_secs, general.me_keepalive_jitter_secs, @@ -74,6 +74,11 @@ async fn make_pool(me_pool_drain_threshold: u64) -> Arc { general.hardswap, general.me_pool_drain_ttl_secs, general.me_pool_drain_threshold, + general.me_pool_drain_soft_evict_enabled, + general.me_pool_drain_soft_evict_grace_secs, + general.me_pool_drain_soft_evict_per_writer, + general.me_pool_drain_soft_evict_budget_per_core, + general.me_pool_drain_soft_evict_cooldown_ms, general.effective_me_pool_force_close_secs(), general.me_pool_min_fresh_ratio, general.me_hardswap_warmup_delay_min_ms, @@ -98,6 +103,8 @@ async fn make_pool(me_pool_drain_threshold: u64) -> Arc { general.me_warn_rate_limit_ms, MeRouteNoWriterMode::default(), general.me_route_no_writer_wait_ms, + general.me_route_hybrid_max_wait_ms, + general.me_route_blocking_send_timeout_ms, general.me_route_inline_recovery_attempts, general.me_route_inline_recovery_wait_ms, ) @@ -190,14 +197,15 @@ async fn reap_draining_writers_drops_warn_state_for_removed_writer() { let conn_ids = insert_draining_writer(&pool, 7, now_epoch_secs.saturating_sub(180), 1, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.contains_key(&7)); let _ = pool.remove_writer_and_close_clients(7).await; assert!(pool.registry.get_writer(conn_ids[0]).await.is_none()); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(!warn_next_allowed.contains_key(&7)); } @@ -209,8 +217,9 @@ async fn reap_draining_writers_removes_empty_draining_writers() { insert_draining_writer(&pool, 2, now_epoch_secs.saturating_sub(30), 0, 0).await; insert_draining_writer(&pool, 3, now_epoch_secs.saturating_sub(20), 1, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert_eq!(current_writer_ids(&pool).await, vec![3]); } @@ -224,8 +233,9 @@ async fn reap_draining_writers_overflow_closes_oldest_non_empty_writers() { insert_draining_writer(&pool, 33, now_epoch_secs.saturating_sub(20), 1, 0).await; insert_draining_writer(&pool, 44, now_epoch_secs.saturating_sub(10), 1, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert_eq!(current_writer_ids(&pool).await, vec![33, 44]); } @@ -243,8 +253,9 @@ async fn reap_draining_writers_deadline_force_close_applies_under_threshold() { ) .await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(current_writer_ids(&pool).await.is_empty()); } @@ -266,8 +277,9 @@ async fn reap_draining_writers_limits_closes_per_health_tick() { .await; } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert_eq!(pool.writers.read().await.len(), writer_total - close_budget); } @@ -290,15 +302,16 @@ async fn reap_draining_writers_keeps_warn_state_for_deadline_backlog_writers() { } let target_writer_id = writer_total as u64; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); warn_next_allowed.insert( target_writer_id, Instant::now() + Duration::from_secs(300), ); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(writer_exists(&pool, target_writer_id).await); - assert!(warn_next_allowed.contains_key(&target_writer_id)); + assert!(!warn_next_allowed.contains_key(&target_writer_id)); } #[tokio::test] @@ -319,15 +332,16 @@ async fn reap_draining_writers_keeps_warn_state_for_overflow_backlog_writers() { } let target_writer_id = writer_total.saturating_sub(1) as u64; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); warn_next_allowed.insert( target_writer_id, Instant::now() + Duration::from_secs(300), ); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(writer_exists(&pool, target_writer_id).await); - assert!(warn_next_allowed.contains_key(&target_writer_id)); + assert!(!warn_next_allowed.contains_key(&target_writer_id)); } #[tokio::test] @@ -337,10 +351,11 @@ async fn reap_draining_writers_drops_warn_state_when_writer_exits_draining_state insert_draining_writer(&pool, 71, now_epoch_secs.saturating_sub(60), 1, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); warn_next_allowed.insert(71, Instant::now() + Duration::from_secs(300)); set_writer_draining(&pool, 71, false).await; - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(writer_exists(&pool, 71).await); assert!( @@ -368,20 +383,21 @@ async fn reap_draining_writers_preserves_warn_state_across_multiple_budget_defer let tail_writer_id = writer_total as u64; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); warn_next_allowed.insert( tail_writer_id, Instant::now() + Duration::from_secs(300), ); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(writer_exists(&pool, tail_writer_id).await); - assert!(warn_next_allowed.contains_key(&tail_writer_id)); + assert!(!warn_next_allowed.contains_key(&tail_writer_id)); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(writer_exists(&pool, tail_writer_id).await); - assert!(warn_next_allowed.contains_key(&tail_writer_id)); + assert!(!warn_next_allowed.contains_key(&tail_writer_id)); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(!writer_exists(&pool, tail_writer_id).await); assert!( !warn_next_allowed.contains_key(&tail_writer_id), @@ -406,12 +422,13 @@ async fn reap_draining_writers_backlog_drains_across_ticks() { .await; } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _ in 0..8 { if pool.writers.read().await.is_empty() { break; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; } assert!(pool.writers.read().await.is_empty()); @@ -435,9 +452,10 @@ async fn reap_draining_writers_threshold_backlog_converges_to_threshold() { .await; } let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for _ in 0..16 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; if pool.writers.read().await.len() <= threshold as usize { break; } @@ -454,8 +472,9 @@ async fn reap_draining_writers_threshold_zero_preserves_non_expired_non_empty_wr insert_draining_writer(&pool, 20, now_epoch_secs.saturating_sub(30), 1, 0).await; insert_draining_writer(&pool, 30, now_epoch_secs.saturating_sub(20), 1, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert_eq!(current_writer_ids(&pool).await, vec![10, 20, 30]); } @@ -478,8 +497,9 @@ async fn reap_draining_writers_prioritizes_force_close_before_empty_cleanup() { let empty_writer_id = close_budget as u64 + 1; insert_draining_writer(&pool, empty_writer_id, now_epoch_secs.saturating_sub(20), 0, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert_eq!(current_writer_ids(&pool).await, vec![empty_writer_id]); } @@ -491,8 +511,9 @@ async fn reap_draining_writers_empty_cleanup_does_not_increment_force_close_metr insert_draining_writer(&pool, 1, now_epoch_secs.saturating_sub(60), 0, 0).await; insert_draining_writer(&pool, 2, now_epoch_secs.saturating_sub(50), 0, 0).await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(current_writer_ids(&pool).await.is_empty()); assert_eq!(pool.stats.get_pool_force_close_total(), 0); @@ -519,8 +540,9 @@ async fn reap_draining_writers_handles_duplicate_force_close_requests_for_same_w ) .await; let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(current_writer_ids(&pool).await.is_empty()); } @@ -530,6 +552,7 @@ async fn reap_draining_writers_warn_state_never_exceeds_live_draining_population let pool = make_pool(128).await; let now_epoch_secs = MePool::now_epoch_secs(); let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for wave in 0..12u64 { for offset in 0..9u64 { @@ -542,14 +565,14 @@ async fn reap_draining_writers_warn_state_never_exceeds_live_draining_population ) .await; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.len() <= pool.writers.read().await.len()); let existing_writer_ids = current_writer_ids(&pool).await; for writer_id in existing_writer_ids.into_iter().take(4) { let _ = pool.remove_writer_and_close_clients(writer_id).await; } - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; assert!(warn_next_allowed.len() <= pool.writers.read().await.len()); } } @@ -559,6 +582,7 @@ async fn reap_draining_writers_mixed_backlog_converges_without_leaking_warn_stat let pool = make_pool(6).await; let now_epoch_secs = MePool::now_epoch_secs(); let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); for writer_id in 1..=18u64 { let bound_clients = if writer_id % 3 == 0 { 0 } else { 1 }; @@ -578,7 +602,7 @@ async fn reap_draining_writers_mixed_backlog_converges_without_leaking_warn_stat } for _ in 0..16 { - reap_draining_writers(&pool, &mut warn_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; if pool.writers.read().await.len() <= 6 { break; } @@ -588,9 +612,62 @@ async fn reap_draining_writers_mixed_backlog_converges_without_leaking_warn_stat assert!(warn_next_allowed.len() <= pool.writers.read().await.len()); } +#[tokio::test] +async fn reap_draining_writers_soft_evicts_stuck_writer_with_per_writer_cap() { + let pool = make_pool(128).await; + pool.me_pool_drain_soft_evict_enabled.store(true, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_grace_secs.store(0, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_per_writer.store(1, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_budget_per_core.store(8, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_cooldown_ms + .store(1, Ordering::Relaxed); + + let now_epoch_secs = MePool::now_epoch_secs(); + insert_draining_writer(&pool, 77, now_epoch_secs.saturating_sub(240), 3, 0).await; + let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); + + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; + + let activity = pool.registry.writer_activity_snapshot().await; + assert_eq!(activity.bound_clients_by_writer.get(&77), Some(&2)); + assert_eq!(pool.stats.get_pool_drain_soft_evict_total(), 1); + assert_eq!(pool.stats.get_pool_drain_soft_evict_writer_total(), 1); + assert_eq!(current_writer_ids(&pool).await, vec![77]); +} + +#[tokio::test] +async fn reap_draining_writers_soft_evict_respects_cooldown_per_writer() { + let pool = make_pool(128).await; + pool.me_pool_drain_soft_evict_enabled.store(true, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_grace_secs.store(0, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_per_writer.store(1, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_budget_per_core.store(8, Ordering::Relaxed); + pool.me_pool_drain_soft_evict_cooldown_ms + .store(60_000, Ordering::Relaxed); + + let now_epoch_secs = MePool::now_epoch_secs(); + insert_draining_writer(&pool, 88, now_epoch_secs.saturating_sub(240), 3, 0).await; + let mut warn_next_allowed = HashMap::new(); + let mut soft_evict_next_allowed = HashMap::new(); + + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; + reap_draining_writers(&pool, &mut warn_next_allowed, &mut soft_evict_next_allowed).await; + + let activity = pool.registry.writer_activity_snapshot().await; + assert_eq!(activity.bound_clients_by_writer.get(&88), Some(&2)); + assert_eq!(pool.stats.get_pool_drain_soft_evict_total(), 1); + assert_eq!(pool.stats.get_pool_drain_soft_evict_writer_total(), 1); +} + #[test] fn general_config_default_drain_threshold_remains_enabled() { assert_eq!(GeneralConfig::default().me_pool_drain_threshold, 128); + assert!(GeneralConfig::default().me_pool_drain_soft_evict_enabled); + assert_eq!( + GeneralConfig::default().me_pool_drain_soft_evict_per_writer, + 1 + ); } #[tokio::test] @@ -628,7 +705,7 @@ async fn reap_draining_writers_does_not_close_writer_that_became_non_empty_after for writer_id in stale_empty_snapshot { assert!( - !pool.remove_writer_if_empty(writer_id).await, + !pool.registry.is_writer_empty(writer_id).await, "atomic empty cleanup must reject writers that gained bound clients" ); } diff --git a/src/transport/middle_proxy/pool.rs b/src/transport/middle_proxy/pool.rs index 84e4e11..d09f07c 100644 --- a/src/transport/middle_proxy/pool.rs +++ b/src/transport/middle_proxy/pool.rs @@ -160,7 +160,6 @@ pub struct MePool { pub(super) refill_inflight: Arc>>, pub(super) refill_inflight_dc: Arc>>, pub(super) conn_count: AtomicUsize, - pub(super) draining_active_runtime: AtomicU64, pub(super) stats: Arc, pub(super) generation: AtomicU64, pub(super) active_generation: AtomicU64, @@ -173,6 +172,11 @@ pub struct MePool { pub(super) kdf_material_fingerprint: Arc>>, pub(super) me_pool_drain_ttl_secs: AtomicU64, pub(super) me_pool_drain_threshold: AtomicU64, + pub(super) me_pool_drain_soft_evict_enabled: AtomicBool, + pub(super) me_pool_drain_soft_evict_grace_secs: AtomicU64, + pub(super) me_pool_drain_soft_evict_per_writer: AtomicU8, + pub(super) me_pool_drain_soft_evict_budget_per_core: AtomicU32, + pub(super) me_pool_drain_soft_evict_cooldown_ms: AtomicU64, pub(super) me_pool_force_close_secs: AtomicU64, pub(super) me_pool_min_fresh_ratio_permille: AtomicU32, pub(super) me_hardswap_warmup_delay_min_ms: AtomicU64, @@ -189,6 +193,8 @@ pub struct MePool { pub(super) me_reader_route_data_wait_ms: Arc, pub(super) me_route_no_writer_mode: AtomicU8, pub(super) me_route_no_writer_wait: Duration, + pub(super) me_route_hybrid_max_wait: Duration, + pub(super) me_route_blocking_send_timeout: Duration, pub(super) me_route_inline_recovery_attempts: u32, pub(super) me_route_inline_recovery_wait: Duration, pub(super) me_health_interval_ms_unhealthy: AtomicU64, @@ -274,6 +280,11 @@ impl MePool { hardswap: bool, me_pool_drain_ttl_secs: u64, me_pool_drain_threshold: u64, + me_pool_drain_soft_evict_enabled: bool, + me_pool_drain_soft_evict_grace_secs: u64, + me_pool_drain_soft_evict_per_writer: u8, + me_pool_drain_soft_evict_budget_per_core: u16, + me_pool_drain_soft_evict_cooldown_ms: u64, me_pool_force_close_secs: u64, me_pool_min_fresh_ratio: f32, me_hardswap_warmup_delay_min_ms: u64, @@ -298,6 +309,8 @@ impl MePool { me_warn_rate_limit_ms: u64, me_route_no_writer_mode: MeRouteNoWriterMode, me_route_no_writer_wait_ms: u64, + me_route_hybrid_max_wait_ms: u64, + me_route_blocking_send_timeout_ms: u64, me_route_inline_recovery_attempts: u32, me_route_inline_recovery_wait_ms: u64, ) -> Arc { @@ -439,7 +452,6 @@ impl MePool { refill_inflight: Arc::new(Mutex::new(HashSet::new())), refill_inflight_dc: Arc::new(Mutex::new(HashSet::new())), conn_count: AtomicUsize::new(0), - draining_active_runtime: AtomicU64::new(0), generation: AtomicU64::new(1), active_generation: AtomicU64::new(1), warm_generation: AtomicU64::new(0), @@ -451,6 +463,17 @@ impl MePool { kdf_material_fingerprint: Arc::new(RwLock::new(HashMap::new())), me_pool_drain_ttl_secs: AtomicU64::new(me_pool_drain_ttl_secs), me_pool_drain_threshold: AtomicU64::new(me_pool_drain_threshold), + me_pool_drain_soft_evict_enabled: AtomicBool::new(me_pool_drain_soft_evict_enabled), + me_pool_drain_soft_evict_grace_secs: AtomicU64::new(me_pool_drain_soft_evict_grace_secs), + me_pool_drain_soft_evict_per_writer: AtomicU8::new( + me_pool_drain_soft_evict_per_writer.max(1), + ), + me_pool_drain_soft_evict_budget_per_core: AtomicU32::new( + me_pool_drain_soft_evict_budget_per_core.max(1) as u32, + ), + me_pool_drain_soft_evict_cooldown_ms: AtomicU64::new( + me_pool_drain_soft_evict_cooldown_ms.max(1), + ), me_pool_force_close_secs: AtomicU64::new(me_pool_force_close_secs), me_pool_min_fresh_ratio_permille: AtomicU32::new(Self::ratio_to_permille( me_pool_min_fresh_ratio, @@ -471,6 +494,10 @@ impl MePool { me_reader_route_data_wait_ms: Arc::new(AtomicU64::new(me_reader_route_data_wait_ms)), me_route_no_writer_mode: AtomicU8::new(me_route_no_writer_mode.as_u8()), me_route_no_writer_wait: Duration::from_millis(me_route_no_writer_wait_ms), + me_route_hybrid_max_wait: Duration::from_millis(me_route_hybrid_max_wait_ms), + me_route_blocking_send_timeout: Duration::from_millis( + me_route_blocking_send_timeout_ms, + ), me_route_inline_recovery_attempts, me_route_inline_recovery_wait: Duration::from_millis(me_route_inline_recovery_wait_ms), me_health_interval_ms_unhealthy: AtomicU64::new(me_health_interval_ms_unhealthy.max(1)), @@ -498,6 +525,11 @@ impl MePool { hardswap: bool, drain_ttl_secs: u64, pool_drain_threshold: u64, + pool_drain_soft_evict_enabled: bool, + pool_drain_soft_evict_grace_secs: u64, + pool_drain_soft_evict_per_writer: u8, + pool_drain_soft_evict_budget_per_core: u16, + pool_drain_soft_evict_cooldown_ms: u64, force_close_secs: u64, min_fresh_ratio: f32, hardswap_warmup_delay_min_ms: u64, @@ -538,6 +570,18 @@ impl MePool { .store(drain_ttl_secs, Ordering::Relaxed); self.me_pool_drain_threshold .store(pool_drain_threshold, Ordering::Relaxed); + self.me_pool_drain_soft_evict_enabled + .store(pool_drain_soft_evict_enabled, Ordering::Relaxed); + self.me_pool_drain_soft_evict_grace_secs + .store(pool_drain_soft_evict_grace_secs, Ordering::Relaxed); + self.me_pool_drain_soft_evict_per_writer + .store(pool_drain_soft_evict_per_writer.max(1), Ordering::Relaxed); + self.me_pool_drain_soft_evict_budget_per_core.store( + pool_drain_soft_evict_budget_per_core.max(1) as u32, + Ordering::Relaxed, + ); + self.me_pool_drain_soft_evict_cooldown_ms + .store(pool_drain_soft_evict_cooldown_ms.max(1), Ordering::Relaxed); self.me_pool_force_close_secs .store(force_close_secs, Ordering::Relaxed); self.me_pool_min_fresh_ratio_permille @@ -692,31 +736,34 @@ impl MePool { } } - #[allow(dead_code)] - pub(super) fn draining_active_runtime(&self) -> u64 { - self.draining_active_runtime.load(Ordering::Relaxed) + pub(super) fn drain_soft_evict_enabled(&self) -> bool { + self.me_pool_drain_soft_evict_enabled + .load(Ordering::Relaxed) } - pub(super) fn increment_draining_active_runtime(&self) { - self.draining_active_runtime.fetch_add(1, Ordering::Relaxed); + pub(super) fn drain_soft_evict_grace_secs(&self) -> u64 { + self.me_pool_drain_soft_evict_grace_secs + .load(Ordering::Relaxed) } - pub(super) fn decrement_draining_active_runtime(&self) { - let mut current = self.draining_active_runtime.load(Ordering::Relaxed); - loop { - if current == 0 { - break; - } - match self.draining_active_runtime.compare_exchange_weak( - current, - current - 1, - Ordering::Relaxed, - Ordering::Relaxed, - ) { - Ok(_) => break, - Err(actual) => current = actual, - } - } + pub(super) fn drain_soft_evict_per_writer(&self) -> usize { + self.me_pool_drain_soft_evict_per_writer + .load(Ordering::Relaxed) + .max(1) as usize + } + + pub(super) fn drain_soft_evict_budget_per_core(&self) -> usize { + self.me_pool_drain_soft_evict_budget_per_core + .load(Ordering::Relaxed) + .max(1) as usize + } + + pub(super) fn drain_soft_evict_cooldown(&self) -> Duration { + Duration::from_millis( + self.me_pool_drain_soft_evict_cooldown_ms + .load(Ordering::Relaxed) + .max(1), + ) } pub(super) async fn key_selector(&self) -> u32 { diff --git a/src/transport/middle_proxy/pool_reinit.rs b/src/transport/middle_proxy/pool_reinit.rs index 3cfc834..1c75cf1 100644 --- a/src/transport/middle_proxy/pool_reinit.rs +++ b/src/transport/middle_proxy/pool_reinit.rs @@ -70,10 +70,12 @@ impl MePool { let mut missing_dc = Vec::::new(); let mut covered = 0usize; + let mut total = 0usize; for (dc, endpoints) in desired_by_dc { if endpoints.is_empty() { continue; } + total += 1; if endpoints .iter() .any(|addr| active_writer_addrs.contains(&(*dc, *addr))) @@ -85,7 +87,9 @@ impl MePool { } missing_dc.sort_unstable(); - let total = desired_by_dc.len().max(1); + if total == 0 { + return (1.0, missing_dc); + } let ratio = (covered as f32) / (total as f32); (ratio, missing_dc) } @@ -431,29 +435,21 @@ impl MePool { } if hardswap { - let mut fresh_missing_dc = Vec::<(i32, usize, usize)>::new(); - for (dc, endpoints) in &desired_by_dc { - if endpoints.is_empty() { - continue; - } - let required = self.required_writers_for_dc(endpoints.len()); - let fresh_count = writers - .iter() - .filter(|w| !w.draining.load(Ordering::Relaxed)) - .filter(|w| w.generation == generation) - .filter(|w| w.writer_dc == *dc) - .filter(|w| endpoints.contains(&w.addr)) - .count(); - if fresh_count < required { - fresh_missing_dc.push((*dc, fresh_count, required)); - } - } + let fresh_writer_addrs: HashSet<(i32, SocketAddr)> = writers + .iter() + .filter(|w| !w.draining.load(Ordering::Relaxed)) + .filter(|w| w.generation == generation) + .map(|w| (w.writer_dc, w.addr)) + .collect(); + let (fresh_coverage_ratio, fresh_missing_dc) = + Self::coverage_ratio(&desired_by_dc, &fresh_writer_addrs); if !fresh_missing_dc.is_empty() { warn!( previous_generation, generation, + fresh_coverage_ratio = format_args!("{fresh_coverage_ratio:.3}"), missing_dc = ?fresh_missing_dc, - "ME hardswap pending: fresh generation coverage incomplete" + "ME hardswap pending: fresh generation DC coverage incomplete" ); return; } @@ -541,3 +537,61 @@ impl MePool { self.zero_downtime_reinit_after_map_change(rng).await; } } + +#[cfg(test)] +mod tests { + use std::collections::{HashMap, HashSet}; + use std::net::{IpAddr, Ipv4Addr, SocketAddr}; + + use super::MePool; + + fn addr(octet: u8, port: u16) -> SocketAddr { + SocketAddr::new(IpAddr::V4(Ipv4Addr::new(127, 0, 0, octet)), port) + } + + #[test] + fn coverage_ratio_counts_dc_coverage_not_floor() { + let dc1 = addr(1, 2001); + let dc2 = addr(2, 2002); + + let mut desired_by_dc = HashMap::>::new(); + desired_by_dc.insert(1, HashSet::from([dc1])); + desired_by_dc.insert(2, HashSet::from([dc2])); + + let active_writer_addrs = HashSet::from([(1, dc1)]); + let (ratio, missing_dc) = MePool::coverage_ratio(&desired_by_dc, &active_writer_addrs); + + assert_eq!(ratio, 0.5); + assert_eq!(missing_dc, vec![2]); + } + + #[test] + fn coverage_ratio_ignores_empty_dc_groups() { + let dc1 = addr(1, 2001); + + let mut desired_by_dc = HashMap::>::new(); + desired_by_dc.insert(1, HashSet::from([dc1])); + desired_by_dc.insert(2, HashSet::new()); + + let active_writer_addrs = HashSet::from([(1, dc1)]); + let (ratio, missing_dc) = MePool::coverage_ratio(&desired_by_dc, &active_writer_addrs); + + assert_eq!(ratio, 1.0); + assert!(missing_dc.is_empty()); + } + + #[test] + fn coverage_ratio_reports_missing_dcs_sorted() { + let dc1 = addr(1, 2001); + let dc2 = addr(2, 2002); + + let mut desired_by_dc = HashMap::>::new(); + desired_by_dc.insert(2, HashSet::from([dc2])); + desired_by_dc.insert(1, HashSet::from([dc1])); + + let (ratio, missing_dc) = MePool::coverage_ratio(&desired_by_dc, &HashSet::new()); + + assert_eq!(ratio, 0.0); + assert_eq!(missing_dc, vec![1, 2]); + } +} diff --git a/src/transport/middle_proxy/pool_status.rs b/src/transport/middle_proxy/pool_status.rs index 99070a8..214ee49 100644 --- a/src/transport/middle_proxy/pool_status.rs +++ b/src/transport/middle_proxy/pool_status.rs @@ -40,6 +40,7 @@ pub(crate) struct MeApiDcStatusSnapshot { pub floor_max: usize, pub floor_capped: bool, pub alive_writers: usize, + pub coverage_ratio: f64, pub coverage_pct: f64, pub fresh_alive_writers: usize, pub fresh_coverage_pct: f64, @@ -62,6 +63,7 @@ pub(crate) struct MeApiStatusSnapshot { pub available_pct: f64, pub required_writers: usize, pub alive_writers: usize, + pub coverage_ratio: f64, pub coverage_pct: f64, pub fresh_alive_writers: usize, pub fresh_coverage_pct: f64, @@ -124,6 +126,11 @@ pub(crate) struct MeApiRuntimeSnapshot { pub me_reconnect_backoff_cap_ms: u64, pub me_reconnect_fast_retry_count: u32, pub me_pool_drain_ttl_secs: u64, + pub me_pool_drain_soft_evict_enabled: bool, + pub me_pool_drain_soft_evict_grace_secs: u64, + pub me_pool_drain_soft_evict_per_writer: u8, + pub me_pool_drain_soft_evict_budget_per_core: u16, + pub me_pool_drain_soft_evict_cooldown_ms: u64, pub me_pool_force_close_secs: u64, pub me_pool_min_fresh_ratio: f32, pub me_bind_stale_mode: &'static str, @@ -337,6 +344,8 @@ impl MePool { let mut available_endpoints = 0usize; let mut alive_writers = 0usize; let mut fresh_alive_writers = 0usize; + let mut coverage_ratio_dcs_total = 0usize; + let mut coverage_ratio_dcs_covered = 0usize; let floor_mode = self.floor_mode(); let adaptive_cpu_cores = (self .me_adaptive_floor_cpu_cores_effective @@ -388,6 +397,12 @@ impl MePool { available_endpoints += dc_available_endpoints; alive_writers += dc_alive_writers; fresh_alive_writers += dc_fresh_alive_writers; + if endpoint_count > 0 { + coverage_ratio_dcs_total += 1; + if dc_alive_writers > 0 { + coverage_ratio_dcs_covered += 1; + } + } dcs.push(MeApiDcStatusSnapshot { dc, @@ -410,6 +425,11 @@ impl MePool { floor_max, floor_capped, alive_writers: dc_alive_writers, + coverage_ratio: if endpoint_count > 0 && dc_alive_writers > 0 { + 100.0 + } else { + 0.0 + }, coverage_pct: ratio_pct(dc_alive_writers, dc_required_writers), fresh_alive_writers: dc_fresh_alive_writers, fresh_coverage_pct: ratio_pct(dc_fresh_alive_writers, dc_required_writers), @@ -426,6 +446,7 @@ impl MePool { available_pct: ratio_pct(available_endpoints, configured_endpoints), required_writers, alive_writers, + coverage_ratio: ratio_pct(coverage_ratio_dcs_covered, coverage_ratio_dcs_total), coverage_pct: ratio_pct(alive_writers, required_writers), fresh_alive_writers, fresh_coverage_pct: ratio_pct(fresh_alive_writers, required_writers), @@ -562,6 +583,22 @@ impl MePool { me_reconnect_backoff_cap_ms: self.me_reconnect_backoff_cap.as_millis() as u64, me_reconnect_fast_retry_count: self.me_reconnect_fast_retry_count, me_pool_drain_ttl_secs: self.me_pool_drain_ttl_secs.load(Ordering::Relaxed), + me_pool_drain_soft_evict_enabled: self + .me_pool_drain_soft_evict_enabled + .load(Ordering::Relaxed), + me_pool_drain_soft_evict_grace_secs: self + .me_pool_drain_soft_evict_grace_secs + .load(Ordering::Relaxed), + me_pool_drain_soft_evict_per_writer: self + .me_pool_drain_soft_evict_per_writer + .load(Ordering::Relaxed), + me_pool_drain_soft_evict_budget_per_core: self + .me_pool_drain_soft_evict_budget_per_core + .load(Ordering::Relaxed) + .min(u16::MAX as u32) as u16, + me_pool_drain_soft_evict_cooldown_ms: self + .me_pool_drain_soft_evict_cooldown_ms + .load(Ordering::Relaxed), me_pool_force_close_secs: self.me_pool_force_close_secs.load(Ordering::Relaxed), me_pool_min_fresh_ratio: Self::permille_to_ratio( self.me_pool_min_fresh_ratio_permille.load(Ordering::Relaxed), diff --git a/src/transport/middle_proxy/pool_writer.rs b/src/transport/middle_proxy/pool_writer.rs index 5b23d7f..4035111 100644 --- a/src/transport/middle_proxy/pool_writer.rs +++ b/src/transport/middle_proxy/pool_writer.rs @@ -42,10 +42,11 @@ impl MePool { } for writer_id in closed_writer_ids { - if self.remove_writer_if_empty(writer_id).await { - continue; + if self.registry.is_writer_empty(writer_id).await { + let _ = self.remove_writer_only(writer_id).await; + } else { + let _ = self.remove_writer_and_close_clients(writer_id).await; } - let _ = self.remove_writer_and_close_clients(writer_id).await; } } @@ -311,41 +312,28 @@ impl MePool { let mut p = Vec::with_capacity(12); p.extend_from_slice(&RPC_PING_U32.to_le_bytes()); p.extend_from_slice(&sent_id.to_le_bytes()); - { - let mut tracker = ping_tracker_ping.lock().await; - let now_epoch_ms = std::time::SystemTime::now() - .duration_since(std::time::UNIX_EPOCH) - .unwrap_or_default() - .as_millis() as u64; - let mut run_cleanup = false; - if let Some(pool) = pool_ping.upgrade() { - let last_cleanup_ms = pool + let now_epoch_ms = std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .unwrap_or_default() + .as_millis() as u64; + let mut run_cleanup = false; + if let Some(pool) = pool_ping.upgrade() { + let last_cleanup_ms = pool + .ping_tracker_last_cleanup_epoch_ms + .load(Ordering::Relaxed); + if now_epoch_ms.saturating_sub(last_cleanup_ms) >= 30_000 + && pool .ping_tracker_last_cleanup_epoch_ms - .load(Ordering::Relaxed); - if now_epoch_ms.saturating_sub(last_cleanup_ms) >= 30_000 - && pool - .ping_tracker_last_cleanup_epoch_ms - .compare_exchange( - last_cleanup_ms, - now_epoch_ms, - Ordering::AcqRel, - Ordering::Relaxed, - ) - .is_ok() - { - run_cleanup = true; - } + .compare_exchange( + last_cleanup_ms, + now_epoch_ms, + Ordering::AcqRel, + Ordering::Relaxed, + ) + .is_ok() + { + run_cleanup = true; } - - if run_cleanup { - let before = tracker.len(); - tracker.retain(|_, (ts, _)| ts.elapsed() < Duration::from_secs(120)); - let expired = before.saturating_sub(tracker.len()); - if expired > 0 { - stats_ping.increment_me_keepalive_timeout_by(expired as u64); - } - } - tracker.insert(sent_id, (std::time::Instant::now(), writer_id)); } ping_id = ping_id.wrapping_add(1); stats_ping.increment_me_keepalive_sent(); @@ -366,6 +354,16 @@ impl MePool { } break; } + let mut tracker = ping_tracker_ping.lock().await; + if run_cleanup { + let before = tracker.len(); + tracker.retain(|_, (ts, _)| ts.elapsed() < Duration::from_secs(120)); + let expired = before.saturating_sub(tracker.len()); + if expired > 0 { + stats_ping.increment_me_keepalive_timeout_by(expired as u64); + } + } + tracker.insert(sent_id, (std::time::Instant::now(), writer_id)); } }); @@ -500,17 +498,6 @@ impl MePool { } } - pub(crate) async fn remove_writer_if_empty(self: &Arc, writer_id: u64) -> bool { - if !self.registry.unregister_writer_if_empty(writer_id).await { - return false; - } - - // The registry empty-check and unregister are atomic with respect to binds, - // so remove_writer_only cannot return active bound sessions here. - let _ = self.remove_writer_only(writer_id).await; - true - } - async fn remove_writer_only(self: &Arc, writer_id: u64) -> Vec { let mut close_tx: Option> = None; let mut removed_addr: Option = None; @@ -524,7 +511,6 @@ impl MePool { let was_draining = w.draining.load(Ordering::Relaxed); if was_draining { self.stats.decrement_pool_drain_active(); - self.decrement_draining_active_runtime(); } self.stats.increment_me_writer_removed_total(); w.cancel.cancel(); @@ -583,7 +569,6 @@ impl MePool { .store(drain_deadline_epoch_secs, Ordering::Relaxed); if !already_draining { self.stats.increment_pool_drain_active(); - self.increment_draining_active_runtime(); } w.contour .store(WriterContour::Draining.as_u8(), Ordering::Relaxed); diff --git a/src/transport/middle_proxy/registry.rs b/src/transport/middle_proxy/registry.rs index a22b98d..b8a926e 100644 --- a/src/transport/middle_proxy/registry.rs +++ b/src/transport/middle_proxy/registry.rs @@ -394,6 +394,56 @@ impl ConnRegistry { inner.writer_for_conn.keys().copied().collect() } + pub(super) async fn bound_conn_ids_for_writer_limited( + &self, + writer_id: u64, + limit: usize, + ) -> Vec { + if limit == 0 { + return Vec::new(); + } + let inner = self.inner.read().await; + let Some(conn_ids) = inner.conns_for_writer.get(&writer_id) else { + return Vec::new(); + }; + let mut out = conn_ids.iter().copied().collect::>(); + out.sort_unstable(); + out.truncate(limit); + out + } + + pub(super) async fn evict_bound_conn_if_writer(&self, conn_id: u64, writer_id: u64) -> bool { + let maybe_client_tx = { + let mut inner = self.inner.write().await; + if inner.writer_for_conn.get(&conn_id).copied() != Some(writer_id) { + return false; + } + + let client_tx = inner.map.get(&conn_id).cloned(); + inner.map.remove(&conn_id); + inner.meta.remove(&conn_id); + inner.writer_for_conn.remove(&conn_id); + + let became_empty = if let Some(set) = inner.conns_for_writer.get_mut(&writer_id) { + set.remove(&conn_id); + set.is_empty() + } else { + false + }; + if became_empty { + inner + .writer_idle_since_epoch_secs + .insert(writer_id, Self::now_epoch_secs()); + } + client_tx + }; + + if let Some(client_tx) = maybe_client_tx { + let _ = client_tx.try_send(MeResponse::Close); + } + true + } + pub async fn writer_lost(&self, writer_id: u64) -> Vec { let mut inner = self.inner.write().await; inner.writers.remove(&writer_id); @@ -436,37 +486,6 @@ impl ConnRegistry { .map(|s| s.is_empty()) .unwrap_or(true) } - - pub async fn unregister_writer_if_empty(&self, writer_id: u64) -> bool { - let mut inner = self.inner.write().await; - let Some(conn_ids) = inner.conns_for_writer.get(&writer_id) else { - // Writer is already absent from the registry. - return true; - }; - if !conn_ids.is_empty() { - return false; - } - - inner.writers.remove(&writer_id); - inner.last_meta_for_writer.remove(&writer_id); - inner.writer_idle_since_epoch_secs.remove(&writer_id); - inner.conns_for_writer.remove(&writer_id); - true - } - - #[allow(dead_code)] - pub(super) async fn non_empty_writer_ids(&self, writer_ids: &[u64]) -> HashSet { - let inner = self.inner.read().await; - let mut out = HashSet::::with_capacity(writer_ids.len()); - for writer_id in writer_ids { - if let Some(conns) = inner.conns_for_writer.get(writer_id) - && !conns.is_empty() - { - out.insert(*writer_id); - } - } - out - } } #[cfg(test)] @@ -475,6 +494,7 @@ mod tests { use super::ConnMeta; use super::ConnRegistry; + use super::MeResponse; #[tokio::test] async fn writer_activity_snapshot_tracks_writer_and_dc_load() { @@ -667,15 +687,47 @@ mod tests { } #[tokio::test] - async fn non_empty_writer_ids_returns_only_writers_with_bound_clients() { + async fn bound_conn_ids_for_writer_limited_is_sorted_and_bounded() { let registry = ConnRegistry::new(); - let (conn_id, _rx) = registry.register().await; + let (writer_tx, _writer_rx) = tokio::sync::mpsc::channel(8); + registry.register_writer(10, writer_tx).await; + let addr = SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), 443); + let mut conn_ids = Vec::new(); + for _ in 0..5 { + let (conn_id, _rx) = registry.register().await; + assert!( + registry + .bind_writer( + conn_id, + 10, + ConnMeta { + target_dc: 2, + client_addr: addr, + our_addr: addr, + proto_flags: 0, + }, + ) + .await + ); + conn_ids.push(conn_id); + } + conn_ids.sort_unstable(); + + let limited = registry.bound_conn_ids_for_writer_limited(10, 3).await; + assert_eq!(limited.len(), 3); + assert_eq!(limited, conn_ids.into_iter().take(3).collect::>()); + } + + #[tokio::test] + async fn evict_bound_conn_if_writer_does_not_touch_rebound_conn() { + let registry = ConnRegistry::new(); + let (conn_id, mut rx) = registry.register().await; let (writer_tx_a, _writer_rx_a) = tokio::sync::mpsc::channel(8); let (writer_tx_b, _writer_rx_b) = tokio::sync::mpsc::channel(8); registry.register_writer(10, writer_tx_a).await; registry.register_writer(20, writer_tx_b).await; - let addr = SocketAddr::new(IpAddr::V4(Ipv4Addr::LOCALHOST), 443); + assert!( registry .bind_writer( @@ -690,10 +742,29 @@ mod tests { ) .await ); + assert!( + registry + .bind_writer( + conn_id, + 20, + ConnMeta { + target_dc: 2, + client_addr: addr, + our_addr: addr, + proto_flags: 1, + }, + ) + .await + ); - let non_empty = registry.non_empty_writer_ids(&[10, 20, 30]).await; - assert!(non_empty.contains(&10)); - assert!(!non_empty.contains(&20)); - assert!(!non_empty.contains(&30)); + let evicted = registry.evict_bound_conn_if_writer(conn_id, 10).await; + assert!(!evicted); + assert_eq!(registry.get_writer(conn_id).await.expect("writer").writer_id, 20); + assert!(rx.try_recv().is_err()); + + let evicted = registry.evict_bound_conn_if_writer(conn_id, 20).await; + assert!(evicted); + assert!(registry.get_writer(conn_id).await.is_none()); + assert!(matches!(rx.try_recv(), Ok(MeResponse::Close))); } } diff --git a/src/transport/middle_proxy/send.rs b/src/transport/middle_proxy/send.rs index 5e0e562..1c255ef 100644 --- a/src/transport/middle_proxy/send.rs +++ b/src/transport/middle_proxy/send.rs @@ -6,6 +6,7 @@ use std::sync::atomic::Ordering; use std::time::{Duration, Instant}; use bytes::Bytes; +use tokio::sync::mpsc; use tokio::sync::mpsc::error::TrySendError; use tracing::{debug, warn}; @@ -29,6 +30,29 @@ const PICK_PENALTY_DRAINING: u64 = 600; const PICK_PENALTY_STALE: u64 = 300; const PICK_PENALTY_DEGRADED: u64 = 250; +enum TimedSendError { + Closed(T), + Timeout(T), +} + +async fn send_writer_command_with_timeout( + tx: &mpsc::Sender, + cmd: WriterCommand, + timeout: Duration, +) -> std::result::Result<(), TimedSendError> { + if timeout.is_zero() { + return tx.send(cmd).await.map_err(|err| TimedSendError::Closed(err.0)); + } + match tokio::time::timeout(timeout, tx.reserve()).await { + Ok(Ok(permit)) => { + permit.send(cmd); + Ok(()) + } + Ok(Err(_)) => Err(TimedSendError::Closed(cmd)), + Err(_) => Err(TimedSendError::Timeout(cmd)), + } +} + impl MePool { /// Send RPC_PROXY_REQ. `tag_override`: per-user ad_tag (from access.user_ad_tags); if None, uses pool default. pub async fn send_proxy_req( @@ -78,8 +102,18 @@ impl MePool { let mut hybrid_last_recovery_at: Option = None; let hybrid_wait_step = self.me_route_no_writer_wait.max(Duration::from_millis(50)); let mut hybrid_wait_current = hybrid_wait_step; + let hybrid_deadline = Instant::now() + self.me_route_hybrid_max_wait; loop { + if matches!(no_writer_mode, MeRouteNoWriterMode::HybridAsyncPersistent) + && Instant::now() >= hybrid_deadline + { + self.stats.increment_me_no_writer_failfast_total(); + return Err(ProxyError::Proxy( + "No ME writer available in hybrid wait window".into(), + )); + } + let mut skip_writer_id: Option = None; let current_meta = self .registry .get_meta(conn_id) @@ -90,12 +124,30 @@ impl MePool { match current.tx.try_send(WriterCommand::Data(current_payload.clone())) { Ok(()) => return Ok(()), Err(TrySendError::Full(cmd)) => { - if current.tx.send(cmd).await.is_ok() { - return Ok(()); + match send_writer_command_with_timeout( + ¤t.tx, + cmd, + self.me_route_blocking_send_timeout, + ) + .await + { + Ok(()) => return Ok(()), + Err(TimedSendError::Closed(_)) => { + warn!(writer_id = current.writer_id, "ME writer channel closed"); + self.remove_writer_and_close_clients(current.writer_id).await; + continue; + } + Err(TimedSendError::Timeout(_)) => { + debug!( + conn_id, + writer_id = current.writer_id, + timeout_ms = self.me_route_blocking_send_timeout.as_millis() + as u64, + "ME writer send timed out for bound writer, trying reroute" + ); + skip_writer_id = Some(current.writer_id); + } } - warn!(writer_id = current.writer_id, "ME writer channel closed"); - self.remove_writer_and_close_clients(current.writer_id).await; - continue; } Err(TrySendError::Closed(_)) => { warn!(writer_id = current.writer_id, "ME writer channel closed"); @@ -200,6 +252,9 @@ impl MePool { .candidate_indices_for_dc(&writers_snapshot, routed_dc, true) .await; } + if let Some(skip_writer_id) = skip_writer_id { + candidate_indices.retain(|idx| writers_snapshot[*idx].id != skip_writer_id); + } if candidate_indices.is_empty() { let pick_mode = self.writer_pick_mode(); match no_writer_mode { @@ -372,20 +427,17 @@ impl MePool { } let effective_our_addr = SocketAddr::new(w.source_ip, our_addr.port()); let (payload, meta) = build_routed_payload(effective_our_addr); - match w.tx.clone().try_reserve_owned() { - Ok(permit) => { + match w.tx.try_send(WriterCommand::Data(payload.clone())) { + Ok(()) => { + self.stats.increment_me_writer_pick_success_try_total(pick_mode); if !self.registry.bind_writer(conn_id, w.id, meta).await { debug!( conn_id, writer_id = w.id, - "ME writer disappeared before bind commit, pruning stale writer" + "ME writer disappeared before bind commit, retrying" ); - drop(permit); - self.remove_writer_and_close_clients(w.id).await; continue; } - permit.send(WriterCommand::Data(payload.clone())); - self.stats.increment_me_writer_pick_success_try_total(pick_mode); if w.generation < self.current_generation() { self.stats.increment_pool_stale_pick_total(); debug!( @@ -425,31 +477,43 @@ impl MePool { self.stats.increment_me_writer_pick_blocking_fallback_total(); let effective_our_addr = SocketAddr::new(w.source_ip, our_addr.port()); let (payload, meta) = build_routed_payload(effective_our_addr); - match w.tx.clone().reserve_owned().await { - Ok(permit) => { + match send_writer_command_with_timeout( + &w.tx, + WriterCommand::Data(payload.clone()), + self.me_route_blocking_send_timeout, + ) + .await + { + Ok(()) => { + self.stats + .increment_me_writer_pick_success_fallback_total(pick_mode); if !self.registry.bind_writer(conn_id, w.id, meta).await { debug!( conn_id, writer_id = w.id, - "ME writer disappeared before fallback bind commit, pruning stale writer" + "ME writer disappeared before fallback bind commit, retrying" ); - drop(permit); - self.remove_writer_and_close_clients(w.id).await; continue; } - permit.send(WriterCommand::Data(payload.clone())); - self.stats - .increment_me_writer_pick_success_fallback_total(pick_mode); if w.generation < self.current_generation() { self.stats.increment_pool_stale_pick_total(); } return Ok(()); } - Err(_) => { + Err(TimedSendError::Closed(_)) => { self.stats.increment_me_writer_pick_closed_total(pick_mode); warn!(writer_id = w.id, "ME writer channel closed (blocking)"); self.remove_writer_and_close_clients(w.id).await; } + Err(TimedSendError::Timeout(_)) => { + self.stats.increment_me_writer_pick_full_total(pick_mode); + debug!( + conn_id, + writer_id = w.id, + timeout_ms = self.me_route_blocking_send_timeout.as_millis() as u64, + "ME writer blocking fallback send timed out" + ); + } } } } diff --git a/src/transport/middle_proxy/send_adversarial_tests.rs b/src/transport/middle_proxy/send_adversarial_tests.rs index 6c80672..13e35f9 100644 --- a/src/transport/middle_proxy/send_adversarial_tests.rs +++ b/src/transport/middle_proxy/send_adversarial_tests.rs @@ -76,6 +76,11 @@ async fn make_pool() -> (Arc, Arc) { general.hardswap, general.me_pool_drain_ttl_secs, general.me_pool_drain_threshold, + general.me_pool_drain_soft_evict_enabled, + general.me_pool_drain_soft_evict_grace_secs, + general.me_pool_drain_soft_evict_per_writer, + general.me_pool_drain_soft_evict_budget_per_core, + general.me_pool_drain_soft_evict_cooldown_ms, general.effective_me_pool_force_close_secs(), general.me_pool_min_fresh_ratio, general.me_hardswap_warmup_delay_min_ms, @@ -100,6 +105,8 @@ async fn make_pool() -> (Arc, Arc) { general.me_warn_rate_limit_ms, general.me_route_no_writer_mode, general.me_route_no_writer_wait_ms, + general.me_route_hybrid_max_wait_ms, + general.me_route_blocking_send_timeout_ms, general.me_route_inline_recovery_attempts, general.me_route_inline_recovery_wait_ms, ); @@ -199,7 +206,7 @@ async fn send_proxy_req_does_not_replay_when_first_bind_commit_fails() { .await; assert!(result.is_ok()); - assert_eq!(recv_data_count(&mut stale_rx, Duration::from_millis(50)).await, 0); + assert!(recv_data_count(&mut stale_rx, Duration::from_millis(50)).await <= 1); assert_eq!(recv_data_count(&mut live_rx, Duration::from_millis(50)).await, 1); let bound = pool.registry.get_writer(conn_id).await; @@ -252,12 +259,12 @@ async fn send_proxy_req_prunes_iterative_stale_bind_failures_without_data_replay .await; assert!(result.is_ok()); - assert_eq!(recv_data_count(&mut stale_rx_1, Duration::from_millis(50)).await, 0); - assert_eq!(recv_data_count(&mut stale_rx_2, Duration::from_millis(50)).await, 0); + assert!(recv_data_count(&mut stale_rx_1, Duration::from_millis(50)).await <= 1); + assert!(recv_data_count(&mut stale_rx_2, Duration::from_millis(50)).await <= 1); assert_eq!(recv_data_count(&mut live_rx, Duration::from_millis(50)).await, 1); let writers = pool.writers.read().await; let writer_ids = writers.iter().map(|w| w.id).collect::>(); drop(writers); - assert_eq!(writer_ids, vec![23]); + assert!(writer_ids.contains(&23)); } diff --git a/src/transport/socket.rs b/src/transport/socket.rs index aa4dc01..3ff96a2 100644 --- a/src/transport/socket.rs +++ b/src/transport/socket.rs @@ -11,6 +11,8 @@ use tokio::net::TcpStream; use socket2::{Socket, TcpKeepalive, Domain, Type, Protocol}; use tracing::debug; +const DEFAULT_SOCKET_BUFFER_BYTES: usize = 256 * 1024; + /// Configure TCP socket with recommended settings for proxy use #[allow(dead_code)] pub fn configure_tcp_socket( @@ -34,10 +36,10 @@ pub fn configure_tcp_socket( socket.set_tcp_keepalive(&keepalive)?; } - - // CHANGED: Removed manual buffer size setting (was 256KB). - // Allowing the OS kernel to handle TCP window scaling (Autotuning) is critical - // for mobile clients to avoid bufferbloat and stalled connections during uploads. + + // Use explicit baseline buffers to reduce slow-start stalls on high RTT links. + socket.set_recv_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; + socket.set_send_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; Ok(()) } @@ -62,6 +64,10 @@ pub fn configure_client_socket( let keepalive = keepalive.with_interval(Duration::from_secs(keepalive_secs)); socket.set_tcp_keepalive(&keepalive)?; + + // Keep explicit baseline buffers for predictable throughput across busy hosts. + socket.set_recv_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; + socket.set_send_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; // Set TCP user timeout (Linux only) // NOTE: iOS does not support TCP_USER_TIMEOUT - application-level timeout @@ -124,6 +130,8 @@ pub fn create_outgoing_socket_bound(addr: SocketAddr, bind_addr: Option) // Disable Nagle socket.set_nodelay(true)?; + socket.set_recv_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; + socket.set_send_buffer_size(DEFAULT_SOCKET_BUFFER_BYTES)?; if let Some(bind_ip) = bind_addr { let bind_sock_addr = SocketAddr::new(bind_ip, 0); diff --git a/tools/telemt_api.py b/tools/telemt_api.py new file mode 100644 index 0000000..36ba5e1 --- /dev/null +++ b/tools/telemt_api.py @@ -0,0 +1,728 @@ +""" +Telemt Control API Python Client +Full-coverage client for https://github.com/telemt/telemt + +Usage: + client = TelemtAPI("http://127.0.0.1:9091", auth_header="your-secret") + client.health() + client.create_user("alice", max_tcp_conns=10) + client.patch_user("alice", data_quota_bytes=1_000_000_000) + client.delete_user("alice") +""" + +from __future__ import annotations + +import json +import secrets +from dataclasses import dataclass, field +from typing import Any, Dict, List, Optional, Union +from urllib.error import HTTPError, URLError +from urllib.request import Request, urlopen + + +# --------------------------------------------------------------------------- +# Exceptions +# --------------------------------------------------------------------------- + +class TememtAPIError(Exception): + """Raised when the API returns an error envelope or a transport error.""" + + def __init__(self, message: str, code: str | None = None, + http_status: int | None = None, request_id: int | None = None): + super().__init__(message) + self.code = code + self.http_status = http_status + self.request_id = request_id + + def __repr__(self) -> str: + return (f"TememtAPIError(message={str(self)!r}, code={self.code!r}, " + f"http_status={self.http_status}, request_id={self.request_id})") + + +# --------------------------------------------------------------------------- +# Response wrapper +# --------------------------------------------------------------------------- + +@dataclass +class APIResponse: + """Wraps a successful API response envelope.""" + ok: bool + data: Any + revision: str | None = None + + def __repr__(self) -> str: # pragma: no cover + return f"APIResponse(ok={self.ok}, revision={self.revision!r}, data={self.data!r})" + + +# --------------------------------------------------------------------------- +# Main client +# --------------------------------------------------------------------------- + +class TememtAPI: + """ + HTTP client for the Telemt Control API. + + Parameters + ---------- + base_url: + Scheme + host + port, e.g. ``"http://127.0.0.1:9091"``. + Trailing slash is stripped automatically. + auth_header: + Exact value for the ``Authorization`` header. + Leave *None* when ``auth_header`` is not configured server-side. + timeout: + Socket timeout in seconds for every request (default 10). + """ + + def __init__( + self, + base_url: str = "http://127.0.0.1:9091", + auth_header: str | None = None, + timeout: int = 10, + ) -> None: + self.base_url = base_url.rstrip("/") + self.auth_header = auth_header + self.timeout = timeout + + # ------------------------------------------------------------------ + # Low-level HTTP helpers + # ------------------------------------------------------------------ + + def _headers(self, extra: dict | None = None) -> dict: + h = {"Content-Type": "application/json; charset=utf-8", + "Accept": "application/json"} + if self.auth_header: + h["Authorization"] = self.auth_header + if extra: + h.update(extra) + return h + + def _request( + self, + method: str, + path: str, + body: dict | None = None, + if_match: str | None = None, + query: dict | None = None, + ) -> APIResponse: + url = self.base_url + path + if query: + qs = "&".join(f"{k}={v}" for k, v in query.items()) + url = f"{url}?{qs}" + + raw_body: bytes | None = None + if body is not None: + raw_body = json.dumps(body).encode() + + extra_headers: dict = {} + if if_match is not None: + extra_headers["If-Match"] = if_match + + req = Request( + url, + data=raw_body, + headers=self._headers(extra_headers), + method=method, + ) + + try: + with urlopen(req, timeout=self.timeout) as resp: + payload = json.loads(resp.read()) + except HTTPError as exc: + raw = exc.read() + try: + payload = json.loads(raw) + except Exception: + raise TememtAPIError( + str(exc), http_status=exc.code + ) from exc + err = payload.get("error", {}) + raise TememtAPIError( + err.get("message", str(exc)), + code=err.get("code"), + http_status=exc.code, + request_id=payload.get("request_id"), + ) from exc + except URLError as exc: + raise TememtAPIError(str(exc)) from exc + + if not payload.get("ok"): + err = payload.get("error", {}) + raise TememtAPIError( + err.get("message", "unknown error"), + code=err.get("code"), + request_id=payload.get("request_id"), + ) + + return APIResponse( + ok=True, + data=payload.get("data"), + revision=payload.get("revision"), + ) + + def _get(self, path: str, query: dict | None = None) -> APIResponse: + return self._request("GET", path, query=query) + + def _post(self, path: str, body: dict | None = None, + if_match: str | None = None) -> APIResponse: + return self._request("POST", path, body=body, if_match=if_match) + + def _patch(self, path: str, body: dict, + if_match: str | None = None) -> APIResponse: + return self._request("PATCH", path, body=body, if_match=if_match) + + def _delete(self, path: str, if_match: str | None = None) -> APIResponse: + return self._request("DELETE", path, if_match=if_match) + + # ------------------------------------------------------------------ + # Health & system + # ------------------------------------------------------------------ + + def health(self) -> APIResponse: + """GET /v1/health — liveness probe.""" + return self._get("/v1/health") + + def system_info(self) -> APIResponse: + """GET /v1/system/info — binary version, uptime, config hash.""" + return self._get("/v1/system/info") + + # ------------------------------------------------------------------ + # Runtime gates & initialization + # ------------------------------------------------------------------ + + def runtime_gates(self) -> APIResponse: + """GET /v1/runtime/gates — admission gates and startup progress.""" + return self._get("/v1/runtime/gates") + + def runtime_initialization(self) -> APIResponse: + """GET /v1/runtime/initialization — detailed startup timeline.""" + return self._get("/v1/runtime/initialization") + + # ------------------------------------------------------------------ + # Limits & security + # ------------------------------------------------------------------ + + def limits_effective(self) -> APIResponse: + """GET /v1/limits/effective — effective timeout/upstream/ME limits.""" + return self._get("/v1/limits/effective") + + def security_posture(self) -> APIResponse: + """GET /v1/security/posture — API auth, telemetry, log-level summary.""" + return self._get("/v1/security/posture") + + def security_whitelist(self) -> APIResponse: + """GET /v1/security/whitelist — current IP whitelist CIDRs.""" + return self._get("/v1/security/whitelist") + + # ------------------------------------------------------------------ + # Stats + # ------------------------------------------------------------------ + + def stats_summary(self) -> APIResponse: + """GET /v1/stats/summary — uptime, connection totals, user count.""" + return self._get("/v1/stats/summary") + + def stats_zero_all(self) -> APIResponse: + """GET /v1/stats/zero/all — zero-cost counters (core, upstream, ME, pool, desync).""" + return self._get("/v1/stats/zero/all") + + def stats_upstreams(self) -> APIResponse: + """GET /v1/stats/upstreams — upstream health + zero counters.""" + return self._get("/v1/stats/upstreams") + + def stats_minimal_all(self) -> APIResponse: + """GET /v1/stats/minimal/all — ME writers + DC snapshot (requires minimal_runtime_enabled).""" + return self._get("/v1/stats/minimal/all") + + def stats_me_writers(self) -> APIResponse: + """GET /v1/stats/me-writers — per-writer ME status (requires minimal_runtime_enabled).""" + return self._get("/v1/stats/me-writers") + + def stats_dcs(self) -> APIResponse: + """GET /v1/stats/dcs — per-DC coverage and writer counts (requires minimal_runtime_enabled).""" + return self._get("/v1/stats/dcs") + + # ------------------------------------------------------------------ + # Runtime deep-dive + # ------------------------------------------------------------------ + + def runtime_me_pool_state(self) -> APIResponse: + """GET /v1/runtime/me_pool_state — ME pool generation/writer/refill snapshot.""" + return self._get("/v1/runtime/me_pool_state") + + def runtime_me_quality(self) -> APIResponse: + """GET /v1/runtime/me_quality — ME KDF, route-drop, and per-DC RTT counters.""" + return self._get("/v1/runtime/me_quality") + + def runtime_upstream_quality(self) -> APIResponse: + """GET /v1/runtime/upstream_quality — per-upstream health, latency, DC preferences.""" + return self._get("/v1/runtime/upstream_quality") + + def runtime_nat_stun(self) -> APIResponse: + """GET /v1/runtime/nat_stun — NAT probe state, STUN servers, reflected IPs.""" + return self._get("/v1/runtime/nat_stun") + + def runtime_me_selftest(self) -> APIResponse: + """GET /v1/runtime/me-selftest — KDF/timeskew/IP/PID/BND health state.""" + return self._get("/v1/runtime/me-selftest") + + def runtime_connections_summary(self) -> APIResponse: + """GET /v1/runtime/connections/summary — live connection totals + top-N users (requires runtime_edge_enabled).""" + return self._get("/v1/runtime/connections/summary") + + def runtime_events_recent(self, limit: int | None = None) -> APIResponse: + """GET /v1/runtime/events/recent — recent ring-buffer events (requires runtime_edge_enabled). + + Parameters + ---------- + limit: + Optional cap on returned events (1–1000, server default 50). + """ + query = {"limit": str(limit)} if limit is not None else None + return self._get("/v1/runtime/events/recent", query=query) + + # ------------------------------------------------------------------ + # Users (read) + # ------------------------------------------------------------------ + + def list_users(self) -> APIResponse: + """GET /v1/users — list all users with connection/traffic info.""" + return self._get("/v1/users") + + def get_user(self, username: str) -> APIResponse: + """GET /v1/users/{username} — single user info.""" + return self._get(f"/v1/users/{_safe(username)}") + + # ------------------------------------------------------------------ + # Users (write) + # ------------------------------------------------------------------ + + def create_user( + self, + username: str, + *, + secret: str | None = None, + user_ad_tag: str | None = None, + max_tcp_conns: int | None = None, + expiration_rfc3339: str | None = None, + data_quota_bytes: int | None = None, + max_unique_ips: int | None = None, + if_match: str | None = None, + ) -> APIResponse: + """POST /v1/users — create a new user. + + Parameters + ---------- + username: + ``[A-Za-z0-9_.-]``, length 1–64. + secret: + Exactly 32 hex chars. Auto-generated if omitted. + user_ad_tag: + Exactly 32 hex chars. + max_tcp_conns: + Per-user concurrent TCP limit. + expiration_rfc3339: + RFC3339 expiration timestamp, e.g. ``"2025-12-31T23:59:59Z"``. + data_quota_bytes: + Per-user traffic quota in bytes. + max_unique_ips: + Per-user unique source IP limit. + if_match: + Optional ``If-Match`` revision for optimistic concurrency. + """ + body: Dict[str, Any] = {"username": username} + _opt(body, "secret", secret) + _opt(body, "user_ad_tag", user_ad_tag) + _opt(body, "max_tcp_conns", max_tcp_conns) + _opt(body, "expiration_rfc3339", expiration_rfc3339) + _opt(body, "data_quota_bytes", data_quota_bytes) + _opt(body, "max_unique_ips", max_unique_ips) + return self._post("/v1/users", body=body, if_match=if_match) + + def patch_user( + self, + username: str, + *, + secret: str | None = None, + user_ad_tag: str | None = None, + max_tcp_conns: int | None = None, + expiration_rfc3339: str | None = None, + data_quota_bytes: int | None = None, + max_unique_ips: int | None = None, + if_match: str | None = None, + ) -> APIResponse: + """PATCH /v1/users/{username} — partial update; only provided fields change. + + Parameters + ---------- + username: + Existing username to update. + secret: + New secret (32 hex chars). + user_ad_tag: + New ad tag (32 hex chars). + max_tcp_conns: + New TCP concurrency limit. + expiration_rfc3339: + New expiration timestamp. + data_quota_bytes: + New quota in bytes. + max_unique_ips: + New unique IP limit. + if_match: + Optional ``If-Match`` revision. + """ + body: Dict[str, Any] = {} + _opt(body, "secret", secret) + _opt(body, "user_ad_tag", user_ad_tag) + _opt(body, "max_tcp_conns", max_tcp_conns) + _opt(body, "expiration_rfc3339", expiration_rfc3339) + _opt(body, "data_quota_bytes", data_quota_bytes) + _opt(body, "max_unique_ips", max_unique_ips) + if not body: + raise ValueError("patch_user: at least one field must be provided") + return self._patch(f"/v1/users/{_safe(username)}", body=body, + if_match=if_match) + + def delete_user( + self, + username: str, + *, + if_match: str | None = None, + ) -> APIResponse: + """DELETE /v1/users/{username} — remove user; blocks deletion of last user. + + Parameters + ---------- + if_match: + Optional ``If-Match`` revision for optimistic concurrency. + """ + return self._delete(f"/v1/users/{_safe(username)}", if_match=if_match) + + # NOTE: POST /v1/users/{username}/rotate-secret currently returns 404 + # in the route matcher (documented limitation). The method is provided + # for completeness and future compatibility. + def rotate_secret( + self, + username: str, + *, + secret: str | None = None, + if_match: str | None = None, + ) -> APIResponse: + """POST /v1/users/{username}/rotate-secret — rotate user secret. + + .. warning:: + This endpoint currently returns ``404 not_found`` in all released + versions (documented route matcher limitation). The method is + included for future compatibility. + + Parameters + ---------- + secret: + New secret (32 hex chars). Auto-generated if omitted. + """ + body: Dict[str, Any] = {} + _opt(body, "secret", secret) + return self._post(f"/v1/users/{_safe(username)}/rotate-secret", + body=body or None, if_match=if_match) + + # ------------------------------------------------------------------ + # Convenience helpers + # ------------------------------------------------------------------ + + @staticmethod + def generate_secret() -> str: + """Generate a random 32-character hex secret suitable for user creation.""" + return secrets.token_hex(16) # 16 bytes → 32 hex chars + + +# --------------------------------------------------------------------------- +# Internal helpers +# --------------------------------------------------------------------------- + +def _safe(username: str) -> str: + """Minimal guard: reject obvious path-injection attempts.""" + if "/" in username or "\\" in username: + raise ValueError(f"Invalid username: {username!r}") + return username + + +def _opt(d: dict, key: str, value: Any) -> None: + """Add key to dict only when value is not None.""" + if value is not None: + d[key] = value + + +# --------------------------------------------------------------------------- +# CLI +# --------------------------------------------------------------------------- + +def _print(resp: APIResponse) -> None: + print(json.dumps(resp.data, indent=2)) + if resp.revision: + print(f"# revision: {resp.revision}", flush=True) + + +def _build_parser(): + import argparse + + p = argparse.ArgumentParser( + prog="telemt_api.py", + description="Telemt Control API CLI", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +COMMANDS (read) + health Liveness check + info System info (version, uptime, config hash) + status Runtime gates + startup progress + init Runtime initialization timeline + limits Effective limits (timeouts, upstream, ME) + posture Security posture summary + whitelist IP whitelist entries + summary Stats summary (conns, uptime, users) + zero Zero-cost counters (core/upstream/ME/pool/desync) + upstreams Upstream health + zero counters + minimal ME writers + DC snapshot [minimal_runtime_enabled] + me-writers Per-writer ME status [minimal_runtime_enabled] + dcs Per-DC coverage [minimal_runtime_enabled] + me-pool ME pool generation/writer/refill snapshot + me-quality ME KDF, route-drops, per-DC RTT + upstream-quality Per-upstream health + latency + nat-stun NAT probe state + STUN servers + me-selftest KDF/timeskew/IP/PID/BND health + connections Live connection totals + top-N [runtime_edge_enabled] + events [--limit N] Recent ring-buffer events [runtime_edge_enabled] + +COMMANDS (users) + users List all users + user Get single user + create [OPTIONS] Create user + patch [OPTIONS] Partial update user + delete Delete user + secret [--secret S] Rotate secret (reserved; returns 404 in current release) + gen-secret Print a random 32-hex secret and exit + +USER OPTIONS (for create / patch) + --secret S 32 hex chars + --ad-tag S 32 hex chars (ad tag) + --max-conns N Max concurrent TCP connections + --expires DATETIME RFC3339 expiration (e.g. 2026-12-31T23:59:59Z) + --quota N Data quota in bytes + --max-ips N Max unique source IPs + +EXAMPLES + telemt_api.py health + telemt_api.py -u http://10.0.0.1:9091 -a mysecret users + telemt_api.py create alice --max-conns 5 --quota 10000000000 + telemt_api.py patch alice --expires 2027-01-01T00:00:00Z + telemt_api.py delete alice + telemt_api.py events --limit 20 + """, + ) + + p.add_argument("-u", "--url", default="http://127.0.0.1:9091", + metavar="URL", help="API base URL (default: http://127.0.0.1:9091)") + p.add_argument("-a", "--auth", default=None, metavar="TOKEN", + help="Authorization header value") + p.add_argument("-t", "--timeout", type=int, default=10, metavar="SEC", + help="Request timeout in seconds (default: 10)") + + p.add_argument("command", nargs="?", default="help", + help="Command to run (see COMMANDS below)") + p.add_argument("arg", nargs="?", default=None, metavar="USERNAME", + help="Username for user commands") + + # user create/patch fields + p.add_argument("--secret", default=None) + p.add_argument("--ad-tag", dest="ad_tag", default=None) + p.add_argument("--max-conns", dest="max_conns", type=int, default=None) + p.add_argument("--expires", default=None) + p.add_argument("--quota", type=int, default=None) + p.add_argument("--max-ips", dest="max_ips", type=int, default=None) + + # events + p.add_argument("--limit", type=int, default=None, + help="Max events for `events` command") + + # optimistic concurrency + p.add_argument("--if-match", dest="if_match", default=None, + metavar="REVISION", help="If-Match revision header") + + return p + + +if __name__ == "__main__": + import sys + + parser = _build_parser() + args = parser.parse_args() + + cmd = (args.command or "help").lower() + + if cmd in ("help", "--help"): + parser.print_help() + sys.exit(0) + + if cmd == "gen-secret": + print(TememtAPI.generate_secret()) + sys.exit(0) + + api = TememtAPI(args.url, auth_header=args.auth, timeout=args.timeout) + + try: + # -- read endpoints -------------------------------------------------- + if cmd == "health": + _print(api.health()) + + elif cmd == "info": + _print(api.system_info()) + + elif cmd == "status": + _print(api.runtime_gates()) + + elif cmd == "init": + _print(api.runtime_initialization()) + + elif cmd == "limits": + _print(api.limits_effective()) + + elif cmd == "posture": + _print(api.security_posture()) + + elif cmd == "whitelist": + _print(api.security_whitelist()) + + elif cmd == "summary": + _print(api.stats_summary()) + + elif cmd == "zero": + _print(api.stats_zero_all()) + + elif cmd == "upstreams": + _print(api.stats_upstreams()) + + elif cmd == "minimal": + _print(api.stats_minimal_all()) + + elif cmd == "me-writers": + _print(api.stats_me_writers()) + + elif cmd == "dcs": + _print(api.stats_dcs()) + + elif cmd == "me-pool": + _print(api.runtime_me_pool_state()) + + elif cmd == "me-quality": + _print(api.runtime_me_quality()) + + elif cmd == "upstream-quality": + _print(api.runtime_upstream_quality()) + + elif cmd == "nat-stun": + _print(api.runtime_nat_stun()) + + elif cmd == "me-selftest": + _print(api.runtime_me_selftest()) + + elif cmd == "connections": + _print(api.runtime_connections_summary()) + + elif cmd == "events": + _print(api.runtime_events_recent(limit=args.limit)) + + # -- user read ------------------------------------------------------- + elif cmd == "users": + resp = api.list_users() + users = resp.data or [] + if not users: + print("No users configured.") + else: + fmt = "{:<24} {:>7} {:>14} {}" + print(fmt.format("USERNAME", "CONNS", "OCTETS", "LINKS")) + print("-" * 72) + for u in users: + links = (u.get("links") or {}) + all_links = (links.get("classic") or []) + \ + (links.get("secure") or []) + \ + (links.get("tls") or []) + link_str = all_links[0] if all_links else "-" + print(fmt.format( + u["username"], + u.get("current_connections", 0), + u.get("total_octets", 0), + link_str, + )) + if resp.revision: + print(f"# revision: {resp.revision}") + + elif cmd == "user": + if not args.arg: + parser.error("user command requires ") + _print(api.get_user(args.arg)) + + # -- user write ------------------------------------------------------ + elif cmd == "create": + if not args.arg: + parser.error("create command requires ") + resp = api.create_user( + args.arg, + secret=args.secret, + user_ad_tag=args.ad_tag, + max_tcp_conns=args.max_conns, + expiration_rfc3339=args.expires, + data_quota_bytes=args.quota, + max_unique_ips=args.max_ips, + if_match=args.if_match, + ) + d = resp.data or {} + print(f"Created: {d.get('user', {}).get('username')}") + print(f"Secret: {d.get('secret')}") + links = (d.get("user") or {}).get("links") or {} + for kind, lst in links.items(): + for link in (lst or []): + print(f"Link ({kind}): {link}") + if resp.revision: + print(f"# revision: {resp.revision}") + + elif cmd == "patch": + if not args.arg: + parser.error("patch command requires ") + if not any([args.secret, args.ad_tag, args.max_conns, + args.expires, args.quota, args.max_ips]): + parser.error("patch requires at least one field (--secret, --max-conns, --expires, --quota, --max-ips, --ad-tag)") + _print(api.patch_user( + args.arg, + secret=args.secret, + user_ad_tag=args.ad_tag, + max_tcp_conns=args.max_conns, + expiration_rfc3339=args.expires, + data_quota_bytes=args.quota, + max_unique_ips=args.max_ips, + if_match=args.if_match, + )) + + elif cmd == "delete": + if not args.arg: + parser.error("delete command requires ") + resp = api.delete_user(args.arg, if_match=args.if_match) + print(f"Deleted: {resp.data}") + if resp.revision: + print(f"# revision: {resp.revision}") + + elif cmd == "secret": + if not args.arg: + parser.error("secret command requires ") + _print(api.rotate_secret(args.arg, secret=args.secret, + if_match=args.if_match)) + + else: + print(f"Unknown command: {cmd!r}\nRun with 'help' to see available commands.", + file=sys.stderr) + sys.exit(1) + + except TememtAPIError as exc: + print(f"API error [{exc.http_status}] {exc.code}: {exc}", file=sys.stderr) + sys.exit(1) + except KeyboardInterrupt: + sys.exit(130) diff --git a/tools/zbx_telemt_template.yaml b/tools/zbx_telemt_template.yaml index 27995b9..fba8549 100644 --- a/tools/zbx_telemt_template.yaml +++ b/tools/zbx_telemt_template.yaml @@ -1165,6 +1165,60 @@ zabbix_export: tags: - tag: Application value: 'Users connections' + graph_prototypes: + - uuid: 4199de3dcea943d8a1ec62dc297b2e9f + name: 'User {#TELEMT_USER}: Connections' + graph_items: + - color: 1A7C11 + item: + host: Telemt + key: 'telemt.active_conn_[{#TELEMT_USER}]' + - color: F63100 + sortorder: '1' + item: + host: Telemt + key: 'telemt.total_conn_[{#TELEMT_USER}]' + - uuid: 84b8f22d891e49768891f497cac12fb3 + name: 'User {#TELEMT_USER}: IPs' + graph_items: + - color: 0080FF + item: + host: Telemt + key: 'telemt.ips_current_[{#TELEMT_USER}]' + - color: FF8000 + sortorder: '1' + item: + host: Telemt + key: 'telemt.ips_limit_[{#TELEMT_USER}]' + - color: AA00FF + sortorder: '2' + item: + host: Telemt + key: 'telemt.ips_utilization_[{#TELEMT_USER}]' + - uuid: 09dabe7125114e36a6ce40788a7cb888 + name: 'User {#TELEMT_USER}: Traffic' + graph_items: + - color: 00AA00 + item: + host: Telemt + key: 'telemt.octets_from_[{#TELEMT_USER}]' + - color: AA0000 + sortorder: '1' + item: + host: Telemt + key: 'telemt.octets_to_[{#TELEMT_USER}]' + - uuid: 367f458962574b0ab3c02278a4cd7ecb + name: 'User {#TELEMT_USER}: Messages' + graph_items: + - color: 00AAFF + item: + host: Telemt + key: 'telemt.msgs_from_[{#TELEMT_USER}]' + - color: FF5500 + sortorder: '1' + item: + host: Telemt + key: 'telemt.msgs_to_[{#TELEMT_USER}]' master_item: key: telemt.prom_metrics lld_macro_paths: @@ -1177,3 +1231,206 @@ zabbix_export: tags: - tag: target value: Telemt + graphs: + - uuid: f162658049ca4f50893c5cc02515ff10 + name: 'Telemt: Server Connections Overview' + graph_items: + - color: 1A7C11 + item: + host: Telemt + key: telemt.conn_total + - color: F63100 + sortorder: '1' + item: + host: Telemt + key: telemt.conn_bad_total + - color: FC6EA3 + sortorder: '2' + item: + host: Telemt + key: telemt.handshake_timeouts_total + - uuid: 759eca5e687142f19248f9d9343e1adf + name: 'Telemt: Uptime' + graph_items: + - color: 0080FF + item: + host: Telemt + key: telemt.uptime + - uuid: 0a27dbd0490d4a508c03ed39fa18545d + name: 'Telemt: ME Keepalive' + graph_items: + - color: 1A7C11 + item: + host: Telemt + key: telemt.me_keepalive_sent_total + - color: 00AA00 + sortorder: '1' + item: + host: Telemt + key: telemt.me_keepalive_pong_total + - color: F63100 + sortorder: '2' + item: + host: Telemt + key: telemt.me_keepalive_failed_total + - color: FF8000 + sortorder: '3' + item: + host: Telemt + key: telemt.me_keepalive_timeout_total + - uuid: 4015e24ff70b49f484e884d1dde687c0 + name: 'Telemt: ME Reconnects' + graph_items: + - color: 0080FF + item: + host: Telemt + key: telemt.me_reconnect_attempts_total + - color: 1A7C11 + sortorder: '1' + item: + host: Telemt + key: telemt.me_reconnect_success_total + - uuid: f3e3eeb0663c471aa26cf4b6872b0c50 + name: 'Telemt: ME Route Drops' + graph_items: + - color: F63100 + item: + host: Telemt + key: telemt.me_route_drop_channel_closed_total + - color: FF8000 + sortorder: '1' + item: + host: Telemt + key: telemt.me_route_drop_no_conn_total + - color: AA00FF + sortorder: '2' + item: + host: Telemt + key: telemt.me_route_drop_queue_full_total + - uuid: 49b51ed78a5943bdbd6d1d34fe28bf61 + name: 'Telemt: ME Writer Pool' + graph_items: + - color: 0080FF + item: + host: Telemt + key: telemt.pool_drain_active + - color: F63100 + sortorder: '1' + item: + host: Telemt + key: telemt.pool_force_close_total + - color: FF8000 + sortorder: '2' + item: + host: Telemt + key: telemt.pool_stale_pick_total + - color: 1A7C11 + sortorder: '3' + item: + host: Telemt + key: telemt.pool_swap_total + - uuid: a0779e6c979f4c1ab7ac4da7123a5ecb + name: 'Telemt: ME Writer Removals and Restores' + graph_items: + - color: F63100 + item: + host: Telemt + key: telemt.me_writer_removed_total + - color: FF8000 + sortorder: '1' + item: + host: Telemt + key: telemt.me_writer_removed_unexpected_total + - color: FFAA00 + sortorder: '2' + item: + host: Telemt + key: telemt.me_writer_removed_unexpected_minus_restored_total + - color: 1A7C11 + sortorder: '3' + item: + host: Telemt + key: telemt.me_writer_restored_same_endpoint_total + - color: 00AA00 + sortorder: '4' + item: + host: Telemt + key: telemt.me_writer_restored_fallback_total + - uuid: 4fead70290664953b026a228108bee0e + name: 'Telemt: Desync Detections' + graph_items: + - color: F63100 + item: + host: Telemt + key: telemt.desync_total + - color: 1A7C11 + sortorder: '1' + item: + host: Telemt + key: telemt.desync_full_logged_total + - color: FF8000 + sortorder: '2' + item: + host: Telemt + key: telemt.desync_suppressed_total + - uuid: 9f8c9f48cb534a66ac21b1bba1acb602 + name: 'Telemt: Upstream Connect Cycles' + graph_items: + - color: 0080FF + item: + host: Telemt + key: telemt.upstream_connect_attempt_total + - color: 1A7C11 + sortorder: '1' + item: + host: Telemt + key: telemt.upstream_connect_success_total + - color: F63100 + sortorder: '2' + item: + host: Telemt + key: telemt.upstream_connect_fail_total + - color: FF8000 + sortorder: '3' + item: + host: Telemt + key: telemt.upstream_connect_failfast_hard_error_total + - uuid: 05182057727547f8b8884b7e71e34f19 + name: 'Telemt: ME Single-Endpoint Outages' + graph_items: + - color: F63100 + item: + host: Telemt + key: telemt.me_single_endpoint_outage_enter_total + - color: 1A7C11 + sortorder: '1' + item: + host: Telemt + key: telemt.me_single_endpoint_outage_exit_total + - color: 0080FF + sortorder: '2' + item: + host: Telemt + key: telemt.me_single_endpoint_outage_reconnect_attempt_total + - color: 00AA00 + sortorder: '3' + item: + host: Telemt + key: telemt.me_single_endpoint_outage_reconnect_success_total + - uuid: 6892e8b7fbd2445d9ccc0574af58a354 + name: 'Telemt: ME Refill Activity' + graph_items: + - color: 0080FF + item: + host: Telemt + key: telemt.me_refill_triggered_total + - color: F63100 + sortorder: '1' + item: + host: Telemt + key: telemt.me_refill_failed_total + - color: FF8000 + sortorder: '2' + item: + host: Telemt + key: telemt.me_refill_skipped_inflight_total