Incident Report: Resolving ERR_TUNNEL_CONNECTION_FAILED on Our Web Service

A detailed account of the outage on June 10, 2024, its root cause, and the steps taken for resolution and prevention.

ยท

3 min read

A postmortem (or post-mortem) is a process designed to help you learn from past incidents. It usually involves analyzing or discussing an event soon after it has occurred.

When I learned about postmortem and their importance, I hoped for an outage. When one happened, I promised to report my analysis of the problem and the steps I took to solve it. This way, I wouldn't rely on chance or luck, or repeat ineffective solutions while expecting different results.

Issue Summary

On Monday, from 6:19 AM to 9:10 AM GMT, my webpage couldn't connect to the internet. Using developer tools, the status showed an "ERR_TUNNEL_CONNECTION_FAILED" error. This error was due to a problem with the tunnel connection, causing a lack of connectivity and server availability. Both my servers and load balancer were not working. The root cause was a change in the IP addresses of all the servers and the load balancer.

Duration of the Outage:

  • Start Time: 6:19 AM GMT

  • End Time: 9:10 AM GMT

Impact:

  • The web service was down, preventing users from accessing the site.

  • Users experienced a complete inability to connect.

  • Approximately 100% of users were affected during this period.

Timeline (All times in GMT)

  • 6:19 AM: Issue detected by monitoring system alert.

  • 6:30 AM: Investigated network-related issues (netstat, curl, ping).

  • 7:11 AM: Configured new IP addresses and connected via SSH.

  • 7:39 AM: Updated domain and subdomain configurations.

  • 7:45 AM: Installed and configured Nginx.

  • 7:58 AM: Installed and configured HAProxy.

  • 8:11 AM: Configured HAProxy SSL termination and HTTPS settings.

  • 8:35 AM: Failed HAProxy configuration.

  • 8:40 AM: Rolled back before SSL termination configuration.

  • 8:43 AM: Restored 100% of traffic online (HTTP - Not Secure).

  • 8:57 AM: Completed new SSL termination and HTTPS configuration.

  • 9:10 AM: Added monitoring to the server and restored 100% traffic online (HTTPS - Secure).

Root Cause

The root cause was a sudden change in the IP addresses of all servers and the load balancer due to a scheduled update without forewarning. This resulted in connectivity issues and the "ERR_TUNNEL_CONNECTION_FAILED" error.

Resolution and recovery

At 6:30 AM, upon receiving the monitoring alert, I promptly began investigating the network issues. By 7:11 AM, I had obtained and configured the new IP addresses for the servers and load balancers, updating the environment variables. At 7:39 AM, I updated the domain name configurations to reflect the new IP addresses. From 7:45 to 8:11 AM, I installed and configured the Nginx web server and HAProxy load balancer.

At 8:35 AM, I began the SSL termination configuration. However, after an initial failed configuration at 8:40 AM, I successfully rolled back the settings before the SSL configuration. Traffic was restored online in HTTP mode (not secure) by 8:43 AM. A new SSL termination and HTTPS configuration were completed by 8:57 AM.

Finally, by 9:10 AM, the web service was fully operational with 100% traffic online in secure HTTPS mode.

Corrective and Preventative Measures

Improvements and Fixes

  • Automate the configuration process for servers to ensure a quicker recovery time.

  • Enhance the monitoring system to provide forewarnings about scheduled updates that may change IP addresses.

  • Improve documentation and communication protocols for IP address changes to avoid sudden outages.

Task List

  • Patch the Nginx server and update configurations regularly.

  • Add enhanced monitoring on server memory and IP address changes.

  • Develop and implement automation scripts for server and load balancer configurations.

  • Schedule regular training sessions for the team on handling similar outages.

  • Establish a more robust incident response plan to streamline recovery processes.

I am committed to continually improving our technology and operational processes to prevent future outages. I appreciate your patience and apologize for the impact on your users and organization. Thank you for your continued support. ๐Ÿ˜Š

ย