Photo by Brett Jordan on Unsplash
Incident Report: Resolving ERR_TUNNEL_CONNECTION_FAILED on Our Web Service
A detailed account of the outage on June 10, 2024, its root cause, and the steps taken for resolution and prevention.
A postmortem (or post-mortem) is a process designed to help you learn from past incidents. It usually involves analyzing or discussing an event soon after it has occurred.
When I learned about postmortem and their importance, I hoped for an outage. When one happened, I promised to report my analysis of the problem and the steps I took to solve it. This way, I wouldn't rely on chance or luck, or repeat ineffective solutions while expecting different results.
Issue Summary
On Monday, from 6:19 AM to 9:10 AM GMT, my webpage couldn't connect to the internet. Using developer tools, the status showed an "ERR_TUNNEL_CONNECTION_FAILED" error. This error was due to a problem with the tunnel connection, causing a lack of connectivity and server availability. Both my servers and load balancer were not working. The root cause was a change in the IP addresses of all the servers and the load balancer.
Duration of the Outage:
Start Time: 6:19 AM GMT
End Time: 9:10 AM GMT
Impact:
The web service was down, preventing users from accessing the site.
Users experienced a complete inability to connect.
Approximately 100% of users were affected during this period.
Timeline (All times in GMT)
6:19 AM: Issue detected by monitoring system alert.
6:30 AM: Investigated network-related issues (netstat, curl, ping).
7:11 AM: Configured new IP addresses and connected via SSH.
7:39 AM: Updated domain and subdomain configurations.
7:45 AM: Installed and configured Nginx.
7:58 AM: Installed and configured HAProxy.
8:11 AM: Configured HAProxy SSL termination and HTTPS settings.
8:35 AM: Failed HAProxy configuration.
8:40 AM: Rolled back before SSL termination configuration.
8:43 AM: Restored 100% of traffic online (HTTP - Not Secure).
8:57 AM: Completed new SSL termination and HTTPS configuration.
9:10 AM: Added monitoring to the server and restored 100% traffic online (HTTPS - Secure).
Root Cause
The root cause was a sudden change in the IP addresses of all servers and the load balancer due to a scheduled update without forewarning. This resulted in connectivity issues and the "ERR_TUNNEL_CONNECTION_FAILED" error.
Resolution and recovery
At 6:30 AM, upon receiving the monitoring alert, I promptly began investigating the network issues. By 7:11 AM, I had obtained and configured the new IP addresses for the servers and load balancers, updating the environment variables. At 7:39 AM, I updated the domain name configurations to reflect the new IP addresses. From 7:45 to 8:11 AM, I installed and configured the Nginx web server and HAProxy load balancer.
At 8:35 AM, I began the SSL termination configuration. However, after an initial failed configuration at 8:40 AM, I successfully rolled back the settings before the SSL configuration. Traffic was restored online in HTTP mode (not secure) by 8:43 AM. A new SSL termination and HTTPS configuration were completed by 8:57 AM.
Finally, by 9:10 AM, the web service was fully operational with 100% traffic online in secure HTTPS mode.
Corrective and Preventative Measures
Improvements and Fixes
Automate the configuration process for servers to ensure a quicker recovery time.
Enhance the monitoring system to provide forewarnings about scheduled updates that may change IP addresses.
Improve documentation and communication protocols for IP address changes to avoid sudden outages.
Task List
Patch the Nginx server and update configurations regularly.
Add enhanced monitoring on server memory and IP address changes.
Develop and implement automation scripts for server and load balancer configurations.
Schedule regular training sessions for the team on handling similar outages.
Establish a more robust incident response plan to streamline recovery processes.
I am committed to continually improving our technology and operational processes to prevent future outages. I appreciate your patience and apologize for the impact on your users and organization. Thank you for your continued support. ๐