Two additional bugs causing draining writers to appear stuck:
1. pool_drain_active gauge drift: when cleanup races caused a writer
to be removed from the Vec without going through remove_writer_only
(the old unconditional ws.retain bug), decrement_pool_drain_active
was never called. The gauge drifted upward permanently, making it
look like draining writers were accumulating even after removal.
Fix: sync the gauge with the actual draining writer count on every
reap_draining_writers cycle.
2. Draining writers stuck forever when drain_ttl_secs=0 and no
per-writer deadline: draining_writer_timeout_expired returned false
immediately when drain_ttl_secs==0, with no fallback. Writers with
bound clients would never be force-removed.
Fix: use a hard upper bound of 600s (10 minutes) as safety net when
drain_ttl_secs is 0, so draining writers can never get stuck
indefinitely.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three bugs caused ME writers to not be properly removed when ME
connections flapped:
1. Reader task's unconditional ws.retain() removed writers from the
pool Vec without going through remove_writer_only(), skipping
registry cleanup, quarantine, and refill side effects. Fixed by
moving retain inside the cleanup_done CAS block as shutdown-only
fallback.
2. Draining writers bypassed quarantine entirely because trigger_refill
gated both quarantine and refill. Separated: quarantine now runs for
all removals (flapping endpoint is unstable regardless of drain
state), refill remains non-draining only.
3. connectable_endpoints() returned quarantined endpoints immediately
when all DC endpoints were quarantined, nullifying the circuit
breaker for single-endpoint DCs. Now waits for quarantine expiry
with proper Mutex guard drop before sleep.
Also normalized the CAS ordering in ping task cleanup to match the
reader task (CAS-first, then pool.upgrade check).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 9091 port mapping to compose.yml to make the REST API reachable
from outside the container. Previously only port 9090 (metrics) was
published, making the documented curl commands non-functional.
fixes#434
Вернул старый функционал + добавил новый:
- Вернул автоматическое создание конфига с секретом
- Вернул автоматическое создание службы
- Добавил удаление службы и telemt через `install.sh uninstall`
- Полное удаление вместе с конфигом через `install.sh --purge`
- Добавил установку нужной версии `install.sh 3.3.15`
Stdlib-only HTTP client covering all /v1 endpoints with argparse CLI.
Supports If-Match concurrency, typed errors, user CRUD, and all runtime/stats routes.
Usage: ./telemt_api.py help
AI-Generated from API.md.
Partially tested.
Use with caution...
- Introduced adversarial tests to validate the behavior of the health monitoring system under various conditions, including the management of draining writers.
- Implemented integration tests to ensure the health monitor correctly handles expired and empty draining writers.
- Added regression tests to verify the functionality of the draining writers' cleanup process, ensuring it adheres to the defined thresholds and budgets.
- Updated the module structure to include the new test files for better organization and maintainability.