Session Drops and Timeout Runbooks
Use this page to diagnose and resolve unexpected session termination in SSH, RDP, and web-access flows.
Common Causes
| Symptom pattern | Common cause |
|---|---|
| Active sessions terminate during rollout or autoscaling | Bastion pod restart or HPA scale-in terminated a pod that held active sessions |
| Sessions close after fixed idle window | Ingress or load balancer timeout shorter than expected session duration |
| New SSH sessions fail during spikes | CONFIG_MAX_STARTUPS threshold reached on SSH bastion |
| Session ends near a configured policy limit | Session TTL expiry from Gateway SRA configuration |
Runbook 1: Bastion Restart or HPA Scale-In
Diagnostics
- Check for pod restarts and recent scale events in the bastion namespace.
- Review bastion and gateway logs around drop time.
- Confirm session state with
list-sra-sessionsusing both active and completed or terminated filters.
akeyless list-sra-sessions --status-type connecting --status-type connected --status-type completed --status-type terminated --status-type failedResolution
- Increase scale-in protection and rollout conservatism for SRA pods.
- Configure PodDisruptionBudgets for gateway and SRA workloads.
- Delay disruptive operations during peak session windows.
For HPA guardrails, see Scaling and HPA Patterns.
Runbook 2: Timeout Misalignment
Diagnostics
- Compare configured session TTL with ingress and load balancer idle/response timeout values.
- Check platform defaults for your ingress or load balancer tier.
- Correlate timeout interval with user-reported disconnect timing.
Resolution
- Set ingress and load balancer timeout values to match or exceed expected SRA session duration.
- If a custom TTL is used, align network timeout values to that TTL.
- Re-test long-lived sessions after timeout changes.
For platform-specific timeout references, see SRA Requirements.
Runbook 3: CONFIG_MAX_STARTUPS Saturation
CONFIG_MAX_STARTUPS SaturationDiagnostics
- Inspect SSH bastion logs for rejected unauthenticated connection bursts.
- Verify current
CONFIG_MAX_STARTUPSvalue in deployment environment configuration. - Check concurrent unauthenticated connection patterns during incident windows.
Resolution
- Increase
CONFIG_MAX_STARTUPSbased on observed burst profile. - Reduce unauthenticated connection storms by smoothing client retry behavior.
- Combine with ingress and scaling controls to avoid repeated saturation.
Example deployment value:
sra:
env:
- name: CONFIG_MAX_STARTUPS
value: "200:30:300"Runbook 4: Session TTL Expiry
Diagnostics
- Review effective Gateway SRA config for default session TTL.
- Compare session start/end timestamps to TTL policy.
- Confirm whether expiration behavior matches intended security policy.
Resolution
- Update default session TTL if current policy is too short for operational use.
- Reconfirm timeout alignment across ingress, load balancer, and session policy.
- Communicate policy changes to operators and users.
For TTL policy configuration, see Session TTL and Security Controls.
Minimum Incident Dataset to Capture
When escalating an incident, collect:
- Affected cluster name and deployment mode (unified or split).
- Session ID samples and status transitions.
- Bastion/gateway restart evidence and scale-event timestamps.
- Ingress or load balancer timeout values.
- Effective
CONFIG_MAX_STARTUPSsetting.
Runbook 5: RDP Tab Closes Without Error
Diagnostics
- Verify that the
UAM_ADDRenvironment variable in your bastion deployment matches your account's region (for example, MEU for European accounts). - Check bastion authentication logs for errors during the RDP session initiation flow.
- Confirm WebSocket connectivity between the browser and bastion by inspecting browser console errors and network traffic.
- Verify that the account region and authentication service endpoint are correctly configured in Gateway console settings.
Resolution
- Ensure
UAM_ADDRenvironment variable aligns with your account's region. Update the bastion deployment configuration if needed. - Verify account settings in Gateway console under Remote Access configuration.
- Test browser WebSocket connectivity to the bastion endpoint.
- Re-initiate the RDP session after configuration changes.
Minimum Incident Data for RDP Failures
When escalating RDP connection failures, collect in addition to the above:
- Bastion deployment
UAM_ADDRenvironment variable value. - Account region setting from Gateway console.
- Bastion authentication log excerpts around the time of RDP session initiation.
Runbook 6: Active SSH Sessions Drop at Fixed Short Intervals
Use this runbook when SSH sessions disconnect around a fixed short interval (for example, 30 to 60 seconds) even while users are continuously active.
Diagnostics
- Compare direct client-to-target SSH behavior with SSH through SRA. If direct SSH is stable but SRA SSH drops, focus on ingress or load balancer path.
- Measure the disconnect interval. A consistent interval often indicates an ingress or backend timeout, not an SSH daemon issue.
- Review ingress controller configuration and annotations for timeout values on the Gateway and bastion routes.
- In Kubernetes environments that use GKE ingress, check whether backend timeout is still on the default
30svalue.
Resolution
- Increase ingress or load balancer timeout values to match expected SSH session duration.
- For GKE ingress, configure
BackendConfigorGCPBackendPolicytimeoutSecfor SRA services. - If multiple ingress controllers or site-specific ingress resources are used, verify timeout settings are consistent across all SRA-related ingress objects.
- Re-test sustained SSH activity sessions after timeout updates.
For platform timeout baselines, see SRA Requirements.
