Unlocking the Secrets of HTTP 500: Firewalls, TIME_WAIT, and Performance Mastery

by | Mar 14, 2025 | DevOps, Software Architecture

Sudden performance drops, unexplained latencies, and timeouts — but no clear cause? Why a structured root cause analysis is crucial when firewalls misinterpret HTTP 500, terminate TCP connections, and Axis2 AxisFaults create chaos.

1. Initial Situation: A Complex Software Project with Multiple Stakeholders

Background: A High-Criticality Booking System with Strict Requirements

A large software project involving multiple companies was responsible for developing and operating a mission-critical booking system. The software was not only essential for the company’s operations but also extremely expensive to maintain. For this reason, the client decided to implement a process optimization to ensure resources were utilized as efficiently as possible.

The system relied on session tokens to execute business transactions. Since each token was tied to expensive computing resources, it was essential to manage them efficiently. A session pool handled token allocation and ensured that resources were properly utilized without unnecessary waste.

Business Impact: Revenue Loss and Customer Dissatisfaction

Performance issues in this system had direct and visible consequences for customers. Slow response times or failures in transaction processing led to a poor user experience, directly impacting customer satisfaction. If customers failed to complete their transactions, it resulted in revenue loss. Given the business-critical nature of this booking system, ensuring stability and performance was not just a technical necessity but a direct business requirement.

Technical Architecture and Testing Strategy

The session pool was provided and maintained by an external vendor and was built on Axis2. The business logic was developed by another partner and operated in a separate data center. To facilitate seamless communication, the data centers were interconnected via a network connection, which required firewall permissions and routing adjustments.

Prior to the go-live, extensive load testing was conducted to optimize performance and identify potential bottlenecks early.

Dedicated Load Test Areas

  • Server-Side Load Tests: Focused solely on testing the performance of the backend infrastructure.
  • Integration Load Tests: Conducted in a separate reference environment, simulating high-performance traffic across interconnected data centers.

AxisFault Testing Limitation

  • AxisFault errors originated from a backend system outside the test scope, making direct testing impossible.
  • A complex scenario related to session token usage was initially considered irrelevant to the load test — an assumption that later proved costly.

Despite all these efforts, the system successfully passed all tests, showing no signs of trouble.

Go-Live — Everything Works Perfectly… at First

On the scheduled launch date, the system went live. Performance was stable, all teams were satisfied, and operations proceeded smoothly. However, just a few days later, sudden performance drops began to appear…

2. The First Shock: Unexplained Latencies and Timeouts

At first, the problems were sporadic: On some days, the system ran flawlessly, while on others, severe latency spikes and occasional timeouts occurred. This inconsistent pattern made it difficult to pinpoint the root cause.

Why Was the Issue Hard to Detect?

  • Self-Recovering System: Performance issues disappeared on their own, only to reappear later.
  • No Clear Patterns: Some users reported slow responses, while others experienced timeouts.
  • No Obvious Error Messages: Log analysis showed no single error that pointed directly to the root cause.

Then, the situation escalated. The performance drops became more frequent and severe, triggering a full-scale alert across all teams. The issue had to be resolved before it impacted business-critical processes.

Initial Hypotheses

  1. Application Code Issue? Was the HTTP client misconfigured, or were there connection pooling problems?
  2. Library Bug? Was the version of Apache HttpClient being used possibly faulty?
  3. Network Issue? Were firewalls or routing problems preventing requests from being processed?

3. The Wrong Track — Chasing a Ghost Bug

The development team began a deep investigation of the application. Since requests seemed to be getting lost, the focus shifted first to HTTP communication.

First Suspicion: An Issue in the Application?

The application relied on Apache HttpClient for efficient communication. If the client was not properly reusing connections — for example, if session reuse was failing — it could be a potential cause of latency issues.

Developers meticulously examined the configuration and code, tested different parameters, and searched the issue tracker of the library for similar cases.

Contradictions in the Logs

  • Client Logs: Requests were sent successfully.
  • Server Logs: No corresponding requests were received.
  • Network Interruption? This pointed to a likely issue somewhere in between.

At this point, the evidence strongly suggested a network-level problem rather than a fault in the application.

4. The Forensic Breakthrough: Gigabytes of Network Analysis

Since client and server logs contradicted each other, a large-scale network analysis was launched. This required close cooperation between the network and security teams.

Coordinated Network Captures as the Key

  • Multiple Simultaneous Captures: Network teams collected Wireshark and tcpdump logs from five sources:
  • Client system
  • Server system
  • Three firewalls filtering the traffic
  • Days of Monitoring: The issue was elusive, requiring continuous monitoring over several days.

The Breakthrough: Thousands of Half-Open Connections

  • Some Firewalls Closed Connections Immediately: Reacted correctly to HTTP 500 + Connection-Close.
  • Others Kept Connections Open Too Long: Resulted in a backlog of dead connections.
  • TIME_WAIT Blocked Ports: Due to improper connection closures, available resources were exhausted.

This explained why the system recovered periodically — some firewalls eventually cleaned up connections after 30 minutes, temporarily resolving the problem before it started again.

Conclusion

This case study highlights the critical importance of structured root cause analysis in complex networked environments. Several key lessons emerged from our investigation:

  • Handling HTTP 500 Errors Correctly: Since HTTP 500 indicates an internal server error, it is advisable to terminate the HTTP connection rather than reuse it. Continuing to use a faulty connection can lead to unpredictable system behavior and performance degradation.
  • The Importance of Connection Reuse in TLS/HTTPS: TLS/HTTPS handshakes introduce significant overhead before data transmission can begin. Efficient reuse of established connections is essential for maintaining high performance and reducing latency. Misconfigurations in this area can create bottlenecks and unnecessary resource consumption.
  • System-Level Constraints and Half-Open Connections: Operating systems enforce a waiting period before reusing previously occupied port tuples. This prevents port exhaustion but can also cause cascading failures in environments with high connection churn. Our case demonstrated that improperly closed connections led to a buildup of TIME_WAIT states, eventually exhausting available resources.
  • Firewalls and Interpretation Challenges: Not all failure scenarios are standardized, allowing firewall vendors to implement their own interpretations. This can create inconsistencies between different firewall models, leading to unpredictable behaviors in distributed systems. Stateful firewalls, for example, may hold connections open longer than expected or prematurely terminate them based on internal heuristics.
  • Edge Cases in Load Testing: While failure scenarios are among the most challenging to anticipate, they are also the most critical. Traditional load tests often focus on expected transaction patterns, while rare but impactful issues — such as misconfigured firewalls or network congestion — are overlooked. Our case showed that an untested error path in session token handling led to persistent service degradation.
  • Timeouts as a Defense Mechanism: In complex networks, well-defined timeouts serve as a safeguard against attack vectors and unintended connection persistence. However, modifying firewall timeouts is not a trivial task, as changes can have widespread implications on security, application stability, and network performance.
  • The Challenge of Organizational Alignment: Synchronizing firewall settings was not just a technical challenge but a political one. The issue was not the ability to make the necessary changes but the need to convince multiple stakeholders — from security teams to business owners — of the necessity of those changes. This highlights the importance of proactive cross-team collaboration in ensuring system resilience.

Relevant RFCs for Further Reading

0 Comments

Leave a Reply

Explore Articles That Align With Your Interests

Overprovisioned Host System – A Nightmare

Overprovisioned host systems in virtualized environments often cause performance issues. Steal Time is a reliable indicator for identifying such bottlenecks. This article explains how to monitor Steal Time using top, the impact of high values, and how monitoring tools...

Why Event-Driven Architecture?

What is event-driven architecture? What are the advantages of event-driven architecture, and when should I use it? What advantages does it offer, and what price do I pay? In the following, we will look at what constitutes an event-driven architecture and how it...

On-Premise? IaaS vs. PaaS vs. SaaS?

What does it mean to run an application in the cloud? What types of clouds are there, and what responsibilities can they take away from me? Or conversely, what does it mean not to go to the cloud? To clarify these questions, we first need to identify the...