Poor Application Performance: Advanced CPU Analysis with top

by | Oct 6, 2024 | Big Data

Poor application performance often results from CPU overload. This article shows how to use the top command for advanced CPU analysis, focusing on identifying CPU time distribution (User, System, I/O-Wait) and optimizing processes to fix bad application performance in real time.

Recently, an e-commerce website struggled with poor application performance during peak hours. Customers complained about slow page load times and delays, causing frustration and potential business loss. The application had always run smoothly, but suddenly, without significant changes, response times worsened, and users experienced noticeable delays. We suspected CPU overload caused bad application performance, so we investigated using the top command.

This article analyzes how we used top to fix lousy application performance by analyzing CPU usage and identifying bottlenecks.

Step 1: Starting top and Initial Observations – insight into poor application performance

To begin troubleshooting the application performance, we launched top on the affected server:

top - 15:23:45 up 7 days,  4:01,  2 users,  load average: 8.32, 6.55, 5.48
Tasks: 210 total, 1 running, 209 sleeping, 0 stopped, 0 zombie
%Cpu(s): 50.0 us, 30.0 sy, 0.0 ni, 10.0 id, 10.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 16384900 total, 235200 free, 15230000 used, 650000 buff/cache
KiB Swap: 4194300 total, 1048576 free, 3145724 used. 32000 avail Mem

Key observations:

  • CPU Usage Breakdown:
    • 50% User Time (us): Half of the CPU is spent on user-level processes, which likely include application-related tasks like handling web requests.
    • 30% System Time (sy): A significant portion of the CPU is used by the kernel for system tasks such as managing hardware and I/O operations.
    • 10% I/O-Wait (wa): The CPU spends 10% of the time waiting for I/O operations (e.g., disk reads/writes), which could be a significant cause of bad application performance due to storage delays.
  • High Load Average (8.32):
    • The 1-minute load average is 8.32. On a server with 8 CPU cores, this would indicate full utilization. However, if fewer than 8 cores are present, the system is likely overloaded, contributing to the lousy application performance.
  • Task Distribution:
    • Only one task is actively running, while 209 tasks are sleeping. Most processes are likely waiting for CPU or I/O resources, which can cause bad application performance as requests are delayed.

Step 2: Deep Dive into CPU Time

To identify the cause of the poor application performance, we analyzed the CPU time in more detail.

  1. User Time (us) – 50%:
    • Half of the CPU is occupied by user processes, likely the web server and application logic. Possible causes for the high usage include:
      • High traffic causing an overload of web requests.
      • Inefficient application logic or resource-heavy operations, such as image processing or unoptimized database queries.
  2. System Time (sy) – 30%:
    • A high percentage of CPU time is used by the kernel, indicating intensive system operations. Possible causes:
      • High-volume network activity.
      • Intensive disk operations, which might contribute to the bad application performance if the server is constantly reading from or writing to disk.
  3. I/O-Wait (wa) – 10%:
    • The CPU spends 10% of its time waiting for I/O operations. This often signals a disk bottleneck, which is a common cause of lousy application performance. Slow disk speeds, especially on traditional HDDs, can significantly affect response times.

Step 3: Identifying CPU-Heavy Processes

Next, we needed to identify which processes were causing the bad application performance by consuming the most CPU resources. We sorted the processes in top by CPU usage:

PID  USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND 
9876 www-data 20 0 128576 65412 11020 S 30.0 0.4 100:23.55 apache2
11234 mysql 20 0 323432 243432 54320 S 25.0 1.5 240:12.12 mysqld
10345 www-data 20 0 156432 65432 11230 S 20.0 0.5 120:13.22 apache2

Observations:

  • The apache2 process (web server) consumed 30% of the CPU, which could be due to either high traffic or inefficient request handling, causing the poor application performance.
  • The mysqld process (MySQL database) consumed 25% of the CPU, indicating that complex database queries or poor indexing might be slowing down responses.

Step 4: Optimizing the System to Resolve Poor Application Performance

After analyzing the bad application performance, we took the following actions:

  1. Optimizing the Web Application: We reviewed the application code for inefficiencies, especially in resource-heavy processes like image processing or excessive database queries. Implementing caching mechanisms for frequently requested data and optimizing database queries helped reduce CPU load and improve performance.
  2. System Configuration Adjustments:The high system time indicated inefficient use of system resources. We optimized network settings and tuned the server’s I/O scheduler, reducing CPU usage by the kernel and freeing up resources for the application.
  3. Addressing I/O Bottlenecks: Since the I/O-Wait was a significant factor in the bad application performance, we upgraded the storage from HDD to SSD. This improved the speed of disk read/write operations, reducing I/O-Wait time and enhancing the web application’s overall response time.

Conclusion

This real-world scenario demonstrated how lousy application performance can stem from multiple sources, including high CPU usage by user processes, high system CPU usage, and significant I/O-Wait time. By using the top command to identify these issues, we implemented targeted optimizations in the application and system, which successfully resolved the poor application performance.For long-term stability, continuous monitoring of CPU usage and proactive system optimization are essential to prevent bad application performance from recurring. The top command remains a valuable tool for diagnosing and resolving performance issues in real-time.

0 Comments

Leave a Reply

Explore Articles That Align With Your Interests

Overprovisioned Host System – A Nightmare

Overprovisioned host systems in virtualized environments often cause performance issues. Steal Time is a reliable indicator for identifying such bottlenecks. This article explains how to monitor Steal Time using top, the impact of high values, and how monitoring tools...

Why Event-Driven Architecture?

What is event-driven architecture? What are the advantages of event-driven architecture, and when should I use it? What advantages does it offer, and what price do I pay? In the following, we will look at what constitutes an event-driven architecture and how it...

On-Premise? IaaS vs. PaaS vs. SaaS?

What does it mean to run an application in the cloud? What types of clouds are there, and what responsibilities can they take away from me? Or conversely, what does it mean not to go to the cloud? To clarify these questions, we first need to identify the...