SarCheck®: Automated Analysis of Solaris sar and ps data

(English text version s7.01.05)


NOTE: This software is scheduled to expire on 04/05/2010 and has not yet been tied to your system's Host ID. To permanently activate SarCheck, please run /opt/sarcheck/bin/analyze -o and send the output to us so that we can generate an activation key for you.

This is an analysis of the data contained in the file solsar27. The data was collected on 12/27/2009, from 07:00:00 to 14:55:01 , from sun4u system 'sys1227x'. There were 95 sar data records used to produce this analysis. Operating system was Solaris 10. According to psrinfo -p, one physical processor was present. 2044 megabytes of memory were present.

This program will attempt to analyze additional data that was collected on 12/27/2009 by the ps -elf command and the scsolagent. The times and dates of this data will be matched with those in the sar report file solsar27.

DIAGNOSTIC MESSAGE: The number of disks in SarCheck's disk table is 3 and the table is 0.067 percent full. The number of entries in SarCheck's ps table is 138 and the table is 1.840 percent full.

Command line used to produce this report: analyze -diag -dtoo -ptoo -t -html -png -pf solps27 solsar27

Table of Contents

SUMMARY

When the data was collected, no CPU bottleneck could be detected. No memory bottleneck was seen. No significant I/O bottleneck was seen. Limits to future growth have been noted in the Capacity Planning section.

Some of the defaults used by SarCheck's rules have been overridden using the sarcheck_parms file. See the Custom Settings section of the report for more information.

RECOMMENDATIONS SECTION

All recommendations contained in this report are based solely on the conditions which were present when the performance data was collected. It is possible that conditions which were not present at that time may cause some of these recommendations to result in worse performance. To minimize this risk, analyze data from several different days, implement only regularly occurring recommendations, and implement them one at a time.

Change the value of maxpgio from 40 to 1024. The reason for this significant change can be found in the Resource Analysis Section. This parameter can be changed by adding the following line to the /etc/system file: 'set maxpgio = 1024'.

Change the value of segmapsize to 220200960. This will produce a segmap cache size of 210.0 megabytes. This change is recommended because the segmap hit ratio was only 89.15 percent and there may be room for improvement. This parameter can be changed by adding the following line to the /etc/system file: 'set segmapsize = 220200960'. NOTE: Don't forget to check /etc/system first to see if there's already a set command modifying this tunable parameter. If there is, modify that command instead of adding another one.

More information on how to change tunable parameters is available in the System Administration Guide. We recommend making a copy of /etc/system before making changes, and understanding how to use boot -a in case your changes to /etc/system create an unbootable system.

RESOURCE ANALYSIS SECTION

Average CPU utilization was only 17.4 percent. This indicates that spare capacity exists within the CPU. If any performance problems were seen during the monitoring period, they were not caused by a lack of CPU power. User CPU as measured by the %usr column in the sar -u data averaged 14.18 percent and system CPU (%sys) averaged 3.22 percent. The sys/usr ratio averaged 0.23 : 1. CPU utilization peaked at 42 percent from 12:25:00 to 12:30:00. A CPU upgrade is not recommended because the current CPU had significant unused capacity.

Starting with Solaris 10, sar -u '%wio' statistics are no longer reported because they are not considered to be accurate or reliable.

The following graph shows average CPU utilization and the y-axis shows the average percent of time that the processors are busy.

Graph of CPU utilization

The run queue had an average depth of 1.1. This indicates that there was not likely to be a performance problem caused by processes waiting for the CPU.

The peak run queue occupancy seen was 20 percent from 08:05:00 to 08:10:00. The following graph shows the both run queue length and occupancy. The occupancy is shown as %runocc/100, where a run queue occupied 100 percent of the time would be shown a vertical line reaching a height of 1.0.

Graph of run queue length

The average cache hit ratio of logical reads (sar -b) was 100.0 percent, and the average cache hit ratio of logical writes (sar -b) was 64.8 percent. The average buffer cache hit ratio reported by kstat was 99.49 percent. The average rate of buffer cache misses was 0.24 per second. Some of the buffer cache statistics reported by sar were poor but the lack of physical reads and writes suggests that an increase in the value of bufhwm would not be likely to improve performance.

The pageout daemon did not use any CPU time.

The fsflush daemon used only 0.22 percent of a CPU. This indicates that it is probably not using enough of the CPU to cause a problem.

In the event of a system crash, an average of 30 seconds worth of data will be lost because it will not have been written to disk. This is controlled by the autoup and tune_t_fsflushr parameters. This statistic has been calculated using the formula: autoup + (tune_t_fsflushr / 2).

DNLC statistics show that the average DNLC miss rate was 42.1 per second, the average hit rate was 101.6 per second, and the average negative hit rate was 0.9 per second. The percentage of hits (including negative hits) during the monitoring period was 70.9. Negative cache hits occur when an object was not found in the cache and then was looked for again. The peak DNLC miss rate was 756.5 per second from 13:10:00 to 13:20:00. A very high peak DNLC miss rate frequently means that the find command, or something similar, has flushed the directory name lookup cache. This peak can be seen in the following graph.

Graph of DNLC statistics

SarCheck saw activity which resembled usage of the 'find' command. Data from those time periods has been ignored because it can result in misleading recommendations. The DNLC hit ratio was recalculated without those time periods and was approximately 99.73 percent. This hit ratio was high enough to indicate that the DNLC size should not be changed.

The size of the directory name lookup cache was 70554 pages. This information comes from the data file which also contains the ps -elf statistics, and was collected at the same time as the sar data.

The value of maxuprc is 16357 and the size of the process table as reported by sar was 30000. There is no reason to change the value of maxuprc or max_nprocs based on this data.

The size of the inode cache was considerably larger than necessary. If this is usually the case, a small amount of memory is being wasted. Peak used/max statistics for the inode cache during the monitoring period were 16184/129797. The value of %ufs_ipf was always zero, indicating that there was no page flushing, and the size of the inode cache was large enough.

The ratio of exec to fork system calls was 1.15. This indicates that PATH variables are efficient.

The average page out request rate was 0.01 per second. The page out rate peaked at 0.03 per second from 14:05:00 to 14:10:00. The average page in request rate was 11.28 per second. The page in rate peaked at 595.20 per second from 14:30:01 to 14:35:00.

The value of maxpgio, the parameter that limits the number of page-outs per second, is set to 40. There was significant paging activity seen and the recommended increase to 1024 is designed to prevent maxpgio from being a bottleneck. A recommendation to increase the maxpgio has been made despite the lack of significant memory pressure. That is because the recommendation is based on peak paging rates. Sometimes there is significant paging even when there is plenty of memory present.

As seen in the following graph, at least 66.4 percent of the system's memory, or 1357.13 megabytes, was always unused during sar sampling. The value of lotsfree was 8158 pages, or 31.87 megabytes of memory. This indicates that the system has plenty of memory installed, and it may not be used in the most effective way. Please note that measurements of unused memory are not true high-water marks of memory usage and only reflect what was happening when the size of the free list was sampled.

Graph of megabytes of free memory remaining

The average page scanning rate was 0.00 per second. Peak resource utilization statistics can be used to help understand performance problems. If performance was worse during the period of high scanning activity, then a performance bottleneck may be caused by the lack of memory available.

The segmap cache hit ratio averaged 89.15 percent. An average of 41.53 hits per second and 5.05 misses per second were seen during the monitoring period. The rate of misses peaked at 87.80 per second from 13:10:00 to 13:20:00 and the segmap hit ratio during this peak was 88.96 percent. The size of the segmap cache was 128.0 megabytes. A recommendation was made to increase the size of the cache to 210.0 megabytes. This recommendation was made because the hit ratio was below the threshold of 90.0 percent and there may be room for improvement.

The average system-wide local I/O rate as measured by the r+w/s column in the sar -d data was 2.57 per second. This I/O rate peaked at 29 per second from 11:20:00 to 11:40:00.

Graph of Total Disk I/O rate

The following graph shows the average and peak percent busy and service time for 3 disks, not sorted with the -dbusy or -dserv switches.

Graph of up to 5 disks, not sorted by percent busy or service time

The -dtoo switch has been used to format disk statistics into the following table.

Disk Device Statistics
Disk Device Average Percent Busy Peak Percent Busy Queue Depth when occupied Average Service Time
cmdk0 0.3 3.0 0.1 1.5
sd0 0.0 0.0 0.0 0.0
sd1 1.0 17.0 0.2 311.1

The device cmdk0 was busy an average of 0.3 percent of the time and had an average queue depth of 0.1 (when occupied). The average service time reported for this device and its accompanying disk subsystem was 1.5 milliseconds. This is indicative of a very fast disk or a disk controller with cache. Service time is the delay between the time a request was sent to a device and the time that the device signaled completion of the request.

The device sd1 was busy an average of 1.0 percent of the time and had an average queue depth of 0.2 (when occupied). NOTE: The average service time reported for this device and its accompanying disk subsystem was 311.1 milliseconds. This is so slow that the device is not likely to be a typical disk drive. CD-ROM and other non-traditional devices have been known to report very slow service times. If this is a typical disk device it is very slow by modern standards. The poor performance may due to either the disk, the location of multiple filesystems on the disk, or the disk controller.

During multiple time intervals ps -elf data indicated that there were a peak of 77 processes present. This was the largest number of processes seen with ps -elf but it is not likely to be the absolute peak because the operating system does not store the true "high-water mark" for this statistic. There were an average of 73.8 processes present.

Graph of the number of processes present

No runaway processes, memory leaks, or suspiciously large processes were detected in the data contained in file solps27. No table was generated because no unusual resource utilization was seen in the ps data.

The tcp listen drop rate was zero. The value of tcp_conn_req_max_q was 128 and does not need to be changed based on this.

The tcp listen drop Q0 rate was zero. The value of tcp_conn_req_max_q0 was 1024 and does not need to be changed based on this.

For all interfaces, the input error rate and collision rate were always zero percent.

Network interface statistics
Interface Name Ipkt Avg Ipkt Peak Ierr Avg Ierr Peak Opkt Avg Opkt Peak Coll Avg Coll Peak
lo0 62.95 546.50 0.00 0.00 62.95 546.50 0.00 0.00
bge1 613.59 2252.60 0.00 0.00 614.59 2251.47 0.00 0.00
bge2 3262.37 11623.54 0.00 0.00 3230.21 11522.81 0.00 0.00
bge3 5.52 14.10 0.00 0.00 5.52 14.10 0.00 0.00

For interface lo0 there were an average of 62.95 input packets per second processed and 62.95 output packets per second processed. The rate of input packets processed peaked at 546.50 per second during multiple time intervals. The rate of output packets processed peaked at 546.50 per second during multiple time intervals.

For interface bge1 there were an average of 613.59 input packets per second processed and 614.59 output packets per second processed. The rate of input packets processed peaked at 2252.60 per second from 14:10:00 to 14:14:13. The rate of output packets processed peaked at 2251.47 per second from 14:10:00 to 14:14:13.

For interface bge2 there were an average of 3262.37 input packets per second processed and 3230.21 output packets per second processed. The rate of input packets processed peaked at 11623.54 per second from 14:10:00 to 14:14:13. The rate of output packets processed peaked at 11522.81 per second from 14:10:00 to 14:14:13.

For interface bge3 there were an average of 5.52 input packets per second processed and 5.52 output packets per second processed. The rate of input packets processed peaked at 14.10 per second during multiple time intervals. The rate of output packets processed peaked at 14.10 per second during multiple time intervals.

CAPACITY PLANNING SECTION

The section is designed to provide the user with a rudimentary linear capacity planning model and should be used for rough approximations only. These estimates assume that an increase in workload will affect the usage of all resources equally. These estimates should be used on days when the load is heaviest to determine approximately how much spare capacity remains at peak times.

Based on the data available, the system should be able to support a 90 percent increase in workload at peak times before the first resource bottleneck affects performance or reliability, and that bottleneck is likely to be CPU utilization. See the following paragraph for additional information.

Graph of remaining room for growth

The CPU can support an increase in workload of approximately 90 percent at peak times. The busiest disk can support a workload increase of at least 100 percent at peak times. Due to the lack of swap utilization and page scanning, the amount of memory present can support a significantly greater load. The process table, measured by sar -v, can hold at least twice as many entries as were seen. For more information on specific resource utilization, refer to the Resource Analysis section of this report.

CUSTOM SETTINGS SECTION

The default SYSUSR threshold was changed in the sarcheck_parms file from 2.5 to 0.2 percent. This value is likely to compromise the accuracy of the analysis.

The default HSIZE was changed in the sarcheck_parms file from 0.75 to 1.00.

The default CAPCPU threshold was changed in the sarcheck_parms file from 85.0 to 80.0 percent.

Please note: In no event can Aptitune Corporation be held responsible for any damages, including incidental or consequent damages, in connection with or arising out of the use or inability to use this software. All trademarks belong to their respective owners. This software licensed exclusively for use on a single system by: Your Company. This software must be activated by 04/05/2010 (mm/dd/yyyy). Code version: 7.01.05 for Solaris SPARC64. Serial number: 00022333.

This software is updated frequently. For information on the latest version, contact the party from whom SarCheck was originally purchased, or visit our web site.

(c) copyright 1994-2010 by Aptitune Corporation, Portsmouth NH 03801, USA, All Rights Reserved. http://www.sarcheck.com/

Statistics for system: sys1227x
  Start of peak interval End of peak interval Date of peak interval
Statistics collected on: 12/27/2009      
Average CPU utilization: 17.4%      
Peak CPU utilization: 42% 12:25:00 12:30:00 12/27/2009
Average user CPU utilization: 14.2%      
Average sys CPU utilization: 3.2%      
Average run queue depth: 1.1      
Peak run queue depth: 1.2 Multiple peaks Multiple peaks  
Calculated DNLC hit ratio: 70.91%      
Pct of phys memory unused: 66.4%      
Average page out rate: 0.01 / sec      
Peak page out rate: 0.03 / sec 14:05:00 14:10:00 12/27/2009
Average page in rate: 11.28 / sec      
Peak page in rate: 0.03 / sec 14:30:01 14:35:00 12/27/2009
Average page scanning rate: 0.00 / sec      
Peak page scanning rate: 0.03 / sec      
Page scanning threshold: 100.0 / sec      
Average segmap cache miss rate: 5.05 / sec      
Peak segmap cache miss rate: 87.80 / sec 13:10:00 13:20:00 12/27/2009
Segmap hit ratio: 89.15%      
Average cache read hit ratio: 100.0%      
Average cache write hit ratio: 64.8%      
Avg kstat buffer cache hit ratio: 99.5%      
Avg kstat buffer cache miss rate: 0.2 / sec      
Average systemwide I/O rate: 2.6 / sec      
Peak systemwide I/O rate: 29.0 / sec      
Disk device w/highest peak: sd1      
Avg pct busy for that disk: 1.0%      
Peak pct busy for that disk: 17.0% 14:00:00 14:20:00 12/27/2009
Avg number of processes seen by ps: 73.8      
Max number of processes seen by ps: 77 Multiple peaks Multiple peaks  
Average TCP Listen Drop rate: 0.00      
Average TCP Listen Drop peak: 0.00      
Average TCP Listen Drop Q0 rate: 0.00      
Average TCP Listen Drop Q0 peak: 0.00      
Approx CPU capacity remaining: 90.5%      
Approx I/O bandwidth remaining: 100%+      
Remaining process tbl capacity: 100%+      
Can memory support add'l load: Yes