SarCheck(TM): Automated Analysis of Solaris sar and ps data

(English text version s6.01.04)


NOTE: This software is scheduled to expire on 08/21/2004 and has not yet been tied to your system's Host ID. To permanently activate SarCheck, please run /opt/sarcheck/bin/analyze -o and send the output to us so that we can generate an activation key for you.

This is an analysis of the data contained in the file /tmp/rpt. The data was collected on 06/17/2004, from 08:20:01 to 15:40:00 , from system 'drew'. There were 44 sar data records used to produce this analysis. Operating system is Solaris 2.7. One processor is configured. 64 megabytes of memory are present.

Data collected by the ps -elf command on 06/17/2004 from 08:20:01 to 15:40:00, and stored in the file /opt/sarcheck/ps/20040617, will also be analyzed.

The default GRAPHDIR was changed with the -gd switch to /tmp/test.

Table of Contents

SUMMARY

When the data was collected, no CPU bottleneck could be detected. No significant memory bottleneck was seen. No significant I/O bottleneck was seen. A change has been recommended to at least one tunable parameter. Limits to future growth have been noted in the Capacity Planning section.

At least one possible memory leak has been detected. See the Resource Analysis Section for details.

Some of the defaults used by SarCheck's rules have been overridden using the sarcheck_parms file. See the Custom Settings section of the report for more information.

RECOMMENDATIONS SECTION

All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time.

Change the value of maxpgio from 60 to 65536. The reason for this significant change can be found in the Resource Analysis Section. This parameter can be changed by adding the following line to the /etc/system file: 'set maxpgio = 65536'. NOTE: Don't forget to check /etc/system first to see if there's already a set command modifying this tunable parameter. If there is, modify that command instead of adding another one.

Change the value of slowscan from 100 to 500. This parameter can be changed by adding the following line to the /etc/system file: 'set slowscan = 500'. An increase in the value of slowscan has been recommended due to the presence of significant scanning activity and recommendations made by Adrian Cockcroft on page 336 of the second edition of his Sun Performance and Tuning book. NOTE: Don't forget to check /etc/system first to see if there's already a set command modifying this tunable parameter. If there is, modify that command instead of adding another one.

More information on how to change tunable parameters is available in the System Administration Guide. We recommend making a copy of /etc/system before making changes, and understanding how to use boot -a in case your changes to /etc/system create an unbootable system.

RESOURCE ANALYSIS SECTION

Average CPU utilization was only 3.5 percent. This indicates that spare capacity exists within the CPU. If any performance problems were seen during the monitoring period, they were not caused by a lack of CPU power. User CPU as measured by the %usr column in the sar -u data averaged 2.84 percent and system CPU (%sys) averaged 0.66 percent. The sys/usr ratio averaged 0.23 : 1. CPU utilization peaked at 16 percent from 08:30:01 to 08:40:01. A CPU upgrade is not recommended because the current CPU had significant unused capacity.

The CPU was waiting for I/O an average of 2.5 percent of the time. This confirms the lack of a regularly occurring I/O bottleneck. More than 4.0 percent of the CPU's time was regularly spent waiting for disk I/O. This indicates the possibility of an intermittent I/O bottleneck.

The following graph shows average CPU utilization on 06/17/2004 and the y-axis shows the average percent of time that the processor is busy.

Graph of CPU utilization

The CPU was idle (neither busy nor waiting for I/O) and apparently had nothing to do an average of 94.0 percent of the time. If overall performance was good, this means that on average, the CPU was lightly loaded. If performance was generally unacceptable, the bottleneck may have been caused by remote file I/O which cannot be directly measured with sar and cannot be considered by SarCheck.

The run queue had an average depth of 1.2. This indicates that there was not likely to be a performance problem caused by processes waiting for the CPU. Average run queue depth (when occupied) peaked at 2.2 from 11:00:02 to 11:10:00. During that interval, the queue was occupied 1 percent of the time.

The peak run queue occupancy seen was 2 percent from 08:20:01 to 08:40:01. The following graph shows both the run queue length and occupancy. The occupancy is shown as %runocc/10, where a run queue occupied 100 percent of the time would be shown a vertical line reaching a height of 10.0.

Graph of run queue length

The average cache hit ratio of logical reads was 97.7 percent, and the average cache hit ratio of logical writes was 86.1 percent. Despite the room for improvement seen in the hit ratios, no recommendation has been made because disk activity was light.

The pageout daemon used only 0.004 percent of the CPU.

The fsflush daemon used only 0.02 percent of the CPU. This indicates that it is probably not using enough of the CPU to cause a problem.

In the event of a system crash, an average of 32 seconds worth of data will be lost because it will not have been written to disk. This is controlled by the autoup and tune_t_fsflushr parameters. This statistic has been calculated using the formula: autoup + (tune_t_fsflushr / 2).

DNLC statistics show that the average DNLC miss rate was 0.5 per second and the average hit rate was 12.1 per second. The percentage of hits during the monitoring period was 96.2. The peak DNLC miss rate was 3.5 per second from 08:20:01 to 08:30:01.

Graph of DNLC statistics

SarCheck saw activity which resembled usage of the 'find' command. Data from those time periods has been ignored because it can result in misleading recommendations. The DNLC hit ratio was recalculated without those time periods and was approximately 96.24 percent. This hit ratio was high enough to indicate that the DNLC size should not be changed.

The value of maxuprc is 917 and the size of the process table as reported by sar was 922. There is no reason to change the value of maxuprc or max_nprocs based on this data.

The number of active inodes fit in the inode cache. This indicates that the inode cache was large enough to efficiently meet the needs of the system. Peak used/max statistics for the inode cache during the monitoring period were 2901/4236. The value of %ufs_ipf was always zero, indicating that there was no page flushing, and the size of the inode cache was large enough.

As seen in the following graph, at least 2.2 percent of the system's memory, or 1.41 megabytes, was always unused during sar sampling. The value of cachefree was 228 pages, or 1.78 megabytes of memory and the value of lotsfree was 114 pages, or 0.89 megabytes of memory. This indicates that while the system is not in need of memory, there isn't an unusually large quantity of physical memory that remains unused. Please note that measurements of unused memory are not true high-water marks of memory usage and only reflect what was happening when sar sampled system activity.

Graph of megabytes of free memory remaining

The average page scanning rate was 20.15 per second. Page scanning peaked at 83.54 per second from 08:40:01 to 08:50:02. The page daemon scanning rate indicates that an intermittent memory bottleneck may have existed. Peak resource utilization statistics can be used to help understand performance problems. If performance was worse during the period of high scanning activity, then a performance bottleneck may be caused by the lack of memory available. The threshold at which page scanning is considered to be a problem has been calculated at 81.0 per second. This calculation is based on the values of handspreadpages and autoup, and is optimized for sar sampling rates of 10 - 60 minutes.

Graph of page scanning rate

The values of slowscan and fastscan were 100 and 3649. Recommendations for changes to these parameters can be found in the Recommendations Section of this report.

The value of fastscan, the parameter that limits the effect the page scanner has on filesystem thoughput was 3649. Recent testing indicates that the value of fastscan should be set so high that throughput is unaffected. As an experiment on non-production systems, consider setting the value of fastscan to 65536 or higher and let us know if it made a difference.

The value of maxpgio, the parameter that limits the number of page-outs per second, is set to 60. Recent testing indicates that the value of maxpgio should be set so high that it is effectively eliminated. The recommendation to increase maxpgio to 65536 will prevent the page scanner from limiting the number of writes per second.

The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 3.07 per second. This I/O rate peaked at 15 per second from 12:20:01 to 12:30:02.

Graph of Total Disk I/O rate

The following graph shows the average percent busy and service time for 2 disks, not sorted with the -dbusy or -dserv switches.

Graph of up to 5 disks, not sorted by percent busy or service time

The -dtoo switch has been used to format disk statistics into the following table.

Disk Device Statistics
Disk Device Average Percent Busy Peak Percent Busy Queue Depth when occupied Average Service Time
dad0 3.3 16.0 0.3 13.7
sd2 0.0 0.0 0.0 0.0

The device dad0 was busy an average of 3.3 percent of the time and had an average queue depth of 0.3 (when occupied). The average service time reported for this device and its accompanying disk subsystem was 13.7 milliseconds. This is relatively fast considering that queuing time is included in this statistic. Service time is the delay between the time a request was sent to a device and the time that the device signaled completion of the request.

During multiple time intervals ps -elf data indicated that there were a peak of 63 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic. There were an average of 59.9 processes present.

Graph of the number of processes present

The -ptoo switch has been used to format ps data into the following table.

Interesting ps -elf data
Command User Process ID Percent CPU Memory Growth Memory Use
/opt/NSCPcom/.netscape.bin drw 352 0.41 613.0 pg/hr
4.789 mb/hr
4765 pages
37.227 mb

A possible memory leak was seen in /opt/NSCPcom/.netscape.bin, owned by drw, pid 352. Between 08:30:01 and 12:20:01, this process grew from 2415 to 4765 pages. Memory usage grew at an average rate of 613.0 pages/hr during that interval. On this system the page size is 8 kilobytes.

More information on performance analysis and tuning can be found in The System Administration Guide Volumes 1 & 2, and in Adrian Cockcroft's Sun Performance and Tuning.

CAPACITY PLANNING SECTION

The section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.

Based on the data available in this single sar report, the system should be able to support a moderate increase in workload at peak times, and memory is likely to be the first resource bottleneck. See the following paragraphs for additional information.

Graph of remaining room for growth

The CPU can support an increase in workload of at least 100 percent at peak times. Because some swap space was used and significant page scanning or swapping statistics were not seen, the amount of memory present can probably handle a moderate increase in workload. The busiest disk can support a workload increase of at least 100 percent at peak times. For more information on peak CPU and disk utilization, refer to the Resource Analysis section of this report.

The process table, measured by sar -v, can hold at least twice as many entries as were seen.

CUSTOM SETTINGS SECTION

The default SYSUSR threshold was changed in the sarcheck_parms file from 2.5 to 0.2 percent. This value is likely to compromise the accuracy of the analysis.

The default AVGWIO threshold was changed in the sarcheck_parms file from 15.0 to 4.0 percent. This value is likely to compromise the accuracy of the analysis.

The default HSIZE was changed in the sarcheck_parms file from 0.75 to 1.00.

The default CAPCPU threshold was changed in the sarcheck_parms file from 90.0 to 80.0 percent.

The default LGPROC threshold was changed in the sarcheck_parms file from 2048 to 12345 pages.

Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. This software licensed exclusively for use on a single system by: Your Company. This software must be activated by 08/21/2004 (mm/dd/yyyy). Code version: 6.01.04 for Solaris SPARC 64-bit. Serial number: 00099999.

This software is updated frequently. For information on the latest version, contact the party from whom SarCheck was originally purchased, or visit our web site.

(c) copyright 1994-2004 by Aptitune Corporation, Plaistow NH 03865, USA, All Rights Reserved. http://www.sarcheck.com/

Statistics for system: drew
  Start of peak interval End of peak interval Date of peak interval
Statistics collected on: 06/17/2004      
Average CPU utilization: 3.5%      
Peak CPU utilization: 16% 08:30:01 08:40:01 06/17/2004
Average user CPU utilization: 2.8%      
Average sys CPU utilization: 0.7%      
Average waiting for I/O: 2.5%      
Peak waiting for I/O: 13.0% 12:20:01 12:30:02 06/17/2004
Average run queue depth: 1.2      
Peak run queue depth: 2.2 11:00:02 11:10:00 06/17/2004
Calculated DNLC hit ratio: 96.24%      
Pct of phys memory unused: 2.2%      
Average page scanning rate: 20.1 / sec      
Peak page scanning rate: 83.5 / sec 08:40:01 08:50:02 06/17/2004
Page scanning threshold: 81.0 / sec      
Average cache read hit ratio: 97.7%      
Average cache write hit ratio: 86.1%      
Average systemwide I/O rate: 3.07      
Peak systemwide I/O rate: 15.00 12:20:01 12:30:02 06/17/2004
Disk device w/highest peak: dad0      
Avg pct busy for that disk: 3.3%      
Peak pct busy for that disk: 16.0% 12:20:01 12:30:02 06/17/2004
Avg number of processes seen by ps: 59.9      
Max number of processes seen by ps: 63 Multiple peaks Multiple peaks  
Approx CPU capacity remaining: 100%+      
Approx I/O bandwidth remaining: 100%+      
Remaining process tbl capacity: 100%+      
Can memory support add'l load: Moderate