SarCheck®: Automated Analysis of Solaris sar and ps data

(English text version s6.02.09)


This is an analysis of the data contained in the file sar14B. The data was collected on 03/14/2007, from 00:10:00 to 23:50:00 , from system 'SRVSUN1'. There were 71 sar data records used to produce this analysis. Operating system is Solaris 8. 6 processors are configured. 12288 megabytes of memory were present.

This program will attempt to analyze additional data that was collected on 03/14/2007 by the ps -elf command and the dnlcmon agent. The times and dates of this data will be matched with those in the sar report file sar14B. The ps and dnlcmon data file 20070314 will be used.

DIAGNOSTIC MESSAGE: The number of disks in SarCheck's disk table is 41 and the table is 0.911 percent full. The number of entries in SarCheck's ps table is 321 and the table is 3.567 percent full.

Command line used to produce this report: analyze -dtoo -ptoo -t -html -png -dbusy -diag -gd /tmp -pf 20070314 sar14B

Table of Contents

SUMMARY

When the data was collected, occasional CPU bottlenecks may have existed. No memory bottleneck was seen. At least one disk drive was busy enough to suggest an impending performance bottleneck. A change has been recommended to at least one tunable parameter. Limits to future growth have been noted in the Capacity Planning section.

Some of the defaults used by SarCheck's rules have been overridden using the sarcheck_parms file. See the Custom Settings section of the report for more information.

RECOMMENDATIONS SECTION

All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time.

Because occasionally heavy CPU utilization was seen, adjusting process priorities with the priocntl or nice command, optimizing applications, or improved job scheduling may help to prevent occasional performance degradation. A CPU upgrade may not be necessary due to the infrequent nature of the heavy load.

There may be an advantage in moving some of the I/O load from these busy disks: sd2048 , sd2050 , sd2198 , sd2204 .

Change the value of autoup from 30 to 60. This parameter can be changed by adding the following line to the /etc/system file: 'set autoup = 60'. Due to heavy CPU utilization by the fsflush daemon, an increase in the value of autoup is recommended. NOTE: Don't forget to check /etc/system first to see if there's already a set command modifying this tunable parameter. If there is, modify that command instead of adding another one.

More information on how to change tunable parameters is available in the System Administration Guide. We recommend making a copy of /etc/system before making changes, and understanding how to use boot -a in case your changes to /etc/system create an unbootable system.

RESOURCE ANALYSIS SECTION

While average CPU utilization was only 51.6 percent, occasional bursts in excess of 70 percent were seen. Depending on their frequency, periods of performance degradation due to CPU loading may have been felt by end users. User CPU as measured by the %usr column in the sar -u data averaged 33.8 percent and system CPU (%sys) averaged 17.8 percent. The sys/usr ratio averaged 0.53 : 1. CPU utilization peaked at 80 percent from 08:30:01 to 08:50:00. Peak resource utilization statistics can be used to help understand performance problems. If performance was worse during the period of peak CPU utilization, then a performance bottleneck may be the CPU.

Analysis of sar -u '%wio'and '%idle' statistics are suppressed because multiple processors were seen. This data is meaningless and possibly misleading when multiple processors are in use.

The following graph shows average CPU utilization on 03/14/2007 and the y-axis shows the average percent of time that the processors are busy.

Graph of CPU utilization

The run queue had an average depth of 1.9. This indicates that there was not likely to be a performance problem caused by processes waiting for the CPU. Average run queue depth (when occupied) peaked at 2.8 from 21:30:00 to 21:50:01. During that interval, the queue was occupied 8 percent of the time.

The peak run queue occupancy seen was 25 percent from 08:30:01 to 08:50:00. The following graph shows the both run queue length and occupancy. The occupancy is shown as %runocc/100, where a run queue occupied 100 percent of the time would be shown a vertical line reaching a height of 1.0.

Graph of run queue length

The average cache hit ratio of logical reads was 99.9 percent, and the average cache hit ratio of logical writes was 96.2 percent. These statistics, and the lack of any significant memory bottleneck, indicate that there is little to gain by changing the value of bufhwm.

The pageout daemon did not use any CPU time.

The fsflush daemon used 11.45 percent of a CPU. This indicates that an increase in the value of autoup may help to control its CPU utilization. In order to reduce the CPU demands of the fsflush daemon, an increase in the value of autoup is recommended. By definition, raising the value of autoup is a compromise which will increase the amount of data that could be lost in the event of a crash, but will reduce CPU utilization and disk I/O rates.

The tune_t_fsflushr and/or autoup recommendations in the Recommendations Section will change the amount of data which would be lost during a system crash from an average of 32 seconds worth of data to an average of 62 seconds of data.

DNLC statistics show that the average DNLC miss rate was 25.2 per second and the average hit rate was 11723.6 per second. The percentage of hits during the monitoring period was 99.8. The peak DNLC miss rate was 619.8 per second from 04:10:00 to 04:30:00.

Graph of DNLC statistics

The size of the directory name lookup cache was 612352 pages. This information comes from the data file which also contains the ps -elf statistics, and was collected at the same time as the sar data.

The value of maxuprc is 29995 and the size of the process table as reported by sar was 30000. There is no reason to change the value of maxuprc or max_nprocs based on this data.

The number of active inodes fit in the inode cache. This indicates that the inode cache was large enough to efficiently meet the needs of the system. Peak used/max statistics for the inode cache during the monitoring period were 697223/734025. The percentage of igets with page flushes (%ufs_ipf) peaked at 1.28 percent from 03:30:00 to 03:50:00. The non-zero peak indicates that the inode cache should be larger. Peak resource utilization statistics can be used to help understand performance problems. If performance was worse during the period of high page flushing activity, then an increase in the size of the inode cache may result in a noticeable performance improvement. Non-zero data was seen in the %ufs_ipf column of the sar -g report, yet the value of ufs_ninode is not smaller than the value of the maxsize_reached field in netstat -k for that time period.

The average page out request rate was 5.03 per second. The page out rate peaked at 44.09 per second from 20:50:00 to 21:10:00. Peak resource utilization statistics can be used to help understand performance problems. If performance was worse during the period of high page out activity, then a performance bottleneck may be caused by the lack of memory available. The average page in request rate was 342.38 per second. The page in rate peaked at 713.54 per second from 00:30:01 to 00:50:00.

The value of maxpgio, the parameter that limits the number of page-outs per second, is set to 65536. Notable peak paging rates were seen but maxpgio is already set high enough that there's no value in increasing it further.

The average page scanning rate was 0.00 per second.

The values of slowscan and fastscan were 100 and 131072. The value of handspreadpages was 131072. No changes are recommended because no problems were seen.

The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 17.15 per second. This I/O rate peaked at 1314 per second from 03:50:00 to 04:10:00.

Graph of Total Disk I/O rate

The following graph shows the average percent busy and service time for 5 disks, sorted by percent busy.

Graph of up to 5 busiest disks

The following disk analysis has been sorted by the average percent of time the disk was busy.

NOTE: 41 disks were present. By default, the presence of more than 12 disks causes SarCheck to only report on the busiest disks. This is meant to control the verbosity of this report. To see all disks included in the report, use the -d option.

The -dtoo switch has been used to format disk statistics into the following table.

Disk Device Statistics
Results sorted by Average Percent Busy
Disk Device Average Percent Busy Peak Percent Busy Queue Depth when occupied Average Service Time
sd2198 39.1 99.0 1.4 3.5
sd2048 28.1 71.0 0.8 3.2
sd2050 21.2 54.0 0.4 1.3
sd2204 13.1 86.0 0.6 3.7

The device sd2198 was busy an average of 39.1 percent of the time and had an average queue depth of 1.4 (when occupied). Average %busy disk data indicates that the device cannot support a much heavier load without becoming a performance bottleneck. During the peak interval from 00:30:01 to 00:50:00, the disk was 99.0 percent busy. Peak disk busy statistics can be used to help understand performance problems. If performance was worse when the disk was busiest, then a performance bottleneck may be that disk. The average service time reported for this device and its accompanying disk subsystem was 3.5 milliseconds. This is indicative of a very fast disk or a disk controller with cache. Service time is the delay between the time a request was sent to a device and the time that the device signaled completion of the request.

The device sd2048 was busy an average of 28.1 percent of the time and had an average queue depth of 0.8 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 00:50:00 to 01:10:00, the disk was 71.0 percent busy. The average service time reported for this device and its accompanying disk subsystem was 3.2 milliseconds. This is indicative of a very fast disk or a disk controller with cache.

The device sd2050 was busy an average of 21.2 percent of the time and had an average queue depth of 0.4 (when occupied). This indicates that the device is not a performance bottleneck. During the peak interval from 10:30:01 to 10:50:00, the disk was 54.0 percent busy. The average service time reported for this device and its accompanying disk subsystem was 1.3 milliseconds. This is indicative of a very fast disk or a disk controller with cache.

The device sd2204 was busy an average of 13.1 percent of the time and had an average queue depth of 0.6 (when occupied). During the peak interval from 07:50:00 to 08:10:01, the disk was 86.0 percent busy. The average service time reported for this device and its accompanying disk subsystem was 3.7 milliseconds. This is indicative of a very fast disk or a disk controller with cache.

At 14:50:01 ps -elf data indicated that there were a peak of 2504 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic. There were an average of 1002.8 processes present.

Graph of the number of processes present

No runaway processes, memory leaks, or suspiciously large processes were detected in the data contained in file 20070314. No table was generated because no unusual resource utilization was seen in the ps data.

More information on performance analysis and tuning can be found in The System Administration Guide Volumes 1 & 2, and in Adrian Cockcroft's Sun Performance and Tuning.

CAPACITY PLANNING SECTION

The section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.

Based on the limited data available in this single sar report, the system cannot support an increase in workload at peak times without some level of performance degradation. See the following paragraphs for additional information on the capacity of individual system resources. Implementation of some of the suggestions in the recommendations section may help to increase the system's capacity.

Graph of remaining room for growth

The CPU can support an increase in workload of approximately 0 percent at peak times. The busiest disk can support a workload increase of approximately 0 percent at peak times. Due to the lack of swap utilization and page scanning, the amount of memory present can support a significantly greater load. The process table, measured by sar -v, can hold at least twice as many entries as were seen. For more information on specific resource utilization, refer to the Resource Analysis section of this report.

CUSTOM SETTINGS SECTION

The default HSIZE was changed in the sarcheck_parms file from 0.75 to 1.00.

The default CAPCPU threshold was changed in the sarcheck_parms file from 90.0 to 80.0 percent.

The default MLRATE threshold was changed in the sarcheck_parms file from 256.0 to 2001.0 pages per hour.

The default LGPROC threshold was changed in the sarcheck_parms file from 2048 to 800000 pages.

The DCRP entry in the sarcheck_parms file has enabled the reporting of all possible runaway processes.

Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. Evaluation copy for: Your Company. This software expires on 06/15/2007 (mm/dd/yyyy). Code version: 6.02.09 for Solaris SPARC 64-bit. Serial number: 00069602.

Thank you for trying this evaluation copy of SarCheck. To order a licensed version of this software, just type 'analyze -o' at the prompt to produce the order form and follow the instructions.

(c) copyright 1994-2007 by Aptitune Corporation, Plaistow NH 03865, USA, All Rights Reserved. http://www.sarcheck.com/

Statistics for system: SRVSUN1
  Start of peak interval End of peak interval Date of peak interval
Statistics collected on: 03/14/2007      
Average CPU utilization: 51.6%      
Peak CPU utilization: 80% 08:30:01 08:50:00 03/14/2007
Average user CPU utilization: 33.8%      
Average sys CPU utilization: 17.8%      
Average waiting for I/O: 13.4%      
Peak waiting for I/O: 33.0% 00:30:01 00:50:00 03/14/2007
Average run queue depth: 1.9      
Peak run queue depth: 2.8 21:30:00 21:50:01 03/14/2007
Calculated DNLC hit ratio: 99.79%      
Average page out rate: 5.03 / sec      
Peak page out rate: 44.09 / sec 20:50:00 21:10:00 03/14/2007
Average page in rate: 342.38 / sec      
Peak page in rate: 713.54 / sec 00:30:01 00:50:00 03/14/2007
Average page scanning rate: 0.0 / sec      
Peak page scanning rate: 0.0 / sec      
Page scanning threshold: 81.0 / sec      
Average cache read hit ratio: 99.9%      
Average cache write hit ratio: 96.2%      
Average systemwide I/O rate: 17.15 / sec      
Peak systemwide I/O rate: 1314.00 / sec 03:50:00 04:10:00 03/14/2007
Disk device w/highest peak: sd2198      
Avg pct busy for that disk: 39.1%      
Peak pct busy for that disk: 99.0% 00:30:01 00:50:00 03/14/2007
Avg number of processes seen by ps: 1002.8      
Max number of processes seen by ps: 2504 14:50:01   03/14/2007
Approx CPU capacity remaining: 0.0%      
Approx I/O bandwidth remaining: 0.0%      
Remaining process tbl capacity: 100%+      
Can memory support add'l load: Yes