One of the most common problem that we encounter is disk errors on block. The best way to begin is starting our investigation from log files.
Incident:Mar 7 18:00:49 sarge scsi: [ID 107833 kern.notice] Requested Block: 27982496 Error Block: 27982496
Mar 7 18:00:47 sarge scsi: [ID 799468 kern.info] ssd127 at scsi_vhci0: name g60060e8005436200000043620000015e, bus address g60060e8005436200000043620000015e
Mar 7 18:00:47 sarge genunix: [ID 834635 kern.info] /scsi_vhci/ssd@g60060e8005436200000043620000015e (ssd127) multipath status: optimal, path /pci@25,700000/SUNW,qlc@0/fp@0,0 (fp1) to target address: w50060e8005436211,37 is online Load balancing: round-robin
Mar 7 18:00:48 sarge scsi: [ID 107833 kern.warning] WARNING: /scsi_vhci/ssd@g60060e80054362000000436200000045 (ssd78):
Mar 7 18:00:48 sarge scsi: [ID 107833 kern.notice] Requested Block: 6463744 Error Block: 6463744
Mar 7 18:00:48 sarge scsi: [ID 107833 kern.notice] Vendor: HITACHI Serial Number: 50 043620045
Mar 7 18:00:48 sarge scsi: [ID 107833 kern.notice] Sense Key: Unit Attention
Mar 7 18:00:48 sarge scsi: [ID 107833 kern.notice] ASC: 0x3f (reported LUNs data has changed), ASCQ: 0xe, FRU: 0x0
Once we ascertain the error exist and persist, we will proceed to check to disk connection/configuration status.
The cfgadm command provides configuration administration operations on dynamically reconfigurable hardware resources. These operations include displaying status, (-l), initiating testing, (-t), invoking configuration state changes, (-c), invoking hardware specific functions, (-x), and obtaining configuration administration help messages (-h). Configuration administration is performed at attachment points, which are places where system software supports dynamic reconfi guration of hardware resources during continued operation of OS.
sarge:~ # cfgadm
Ap_Id Type Receptacle Occupant Condition
c0 scsi-bus connected configured unknown
c1 scsi-bus connected configured unknown
The mpathadm command enables multipathing discovery and management. The mpathadm command is implemented as a set of subcommands, many with their own options, that are described in the section for that subcommand. Options not associated with a particular subcommand are described under OPTIONS. The mpathadm subcommands operate on a direct-object. These are described in this section for each subcommand. The direct-objects, initiator-port, target-port, andlogical-unit in the subcommands are consistent with SCSI standard definitions.
To list available multipathing support:
sarge:~ # mpathadm list mpath-support
mpath-support: libmpscsi_vhci.so
The view properties for supported multipathing facilities:
sarge:~ # mpathadm show mpath-support libmpscsi_vhci.so
And list initiators port:
sarge:~ # mpathadm list initiator-port
# mpathadm disable/enable path -i 2000000173018713 -t 20030003ba27d095 \
-l /dev/rdsk/c4t60003BA27D2120004204AC2B000DAB00d0s2
To support for many third-party devices is not contained in the default version of the configuration file at /kernel/drv/scsi_vhci.conf. The following shows the changes necessary to bring EMC Symmetrix support into the multipathing software:
sarge:~ # less /kernel/drv/scsi_vhci.conf
The problem is that if 'Channel A' or 'B' is disabled, then we lose all contact with the disk. What we want is failover such that if 'A' fails, then all traffic is routed through 'Channel B' (and vice-versa, fibre-channel stuff).
When the disk becomes available again (after modunload/modload), we can see by log files.
The problem 'resolved' itself when three things occurred!
These were:
1) The '/kernel/drv/fp.conf' file had 2 entries in it for fibre-channel - as if there was a dual-port card present. In our case we only had the one port, so I commented out one of the entries.
2) The 'mpathadm show lu ...' command showed the 'Current Load Balance' as round-robin. This was changed to 'none'.
3) It seems that Sun recently released a patch fixing some problems with Qlogic cards. I tend to run 'pca' to patch my systems, and wasn't really paying too much attention to it I'm afraid! I think the patch was 113042.
Rebooting and reconfiguring the system, the FC card then seemed to work correctly when one of the channels was disabled. As far as we can tell running Solaris 10 with 2 FC cards should work pretty much out of the box with respect to failover.
Useful information, most common events on UNIX/Linux Server Administration and Free/Open Source Software Platforms.
Tuesday, May 10, 2011
Performance Problem - Memory and High CPU utilization - UNIX and Linux Basic
One of the most common problem encountered by Unix/Linux system administrators is high memory/cpu utilization and the first step of each admin is that proceed to check by vmstat command to see if system is paging/swapping.
bash#"vmstat -[option] (or top/mpstat for linux systems)
The option can be: -p to report the paging activiy in details, -S report on swapping rather than paging activity, or numeric options like vmstat 5 1 (summary of the system every 5 seconds on single row).
We have to check also if there are any telnet session lives, the way to discover this:
#bash "ps -leaf |grep ttyp" and compare the output to 'who |grep ttyp' to see if there are unaccounted sessions.
% /usr/ucb/ps aux
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
root 16755 0.1 1.0 1448 1208 pts/0 O 17:33:35 0:00 /usr/ucb/ps uax
root 3 0.1 0.0 0 0 ? S May 24 6:19 fsflush
root 1 0.1 0.6 2232 680 ? S May 24 3:10 /etc/init -
###prstat###---See man page for further details---
The prstat utility iteratively examines all active processes on the system and reports statistics based on the selected output mode and sort order. prstat provides options to examine only processes matching specified PIDs, UIDs, zone IDs, CPU IDs, and processor set IDs.
root:~ # prstat -t or -a (estimates memory usage to high)
The vmstat (Virtual Memory Statistics) is a system monitoring tool that collects and displays summary information about OS memory, processes, interrupts, paging and block I/O. Users of vmstat can specify a sampling interval which permits observing system activity in near-real time.
###vmstat###---See man page for further details---
root:~ # vmstat 5 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 sd sd sd in sy cs us sy id
0 0 0 36548016 22532120 257 889 590 5 5 0 0 0 6 0 3 7464 5758 5109 5 1 94
0 0 0 33598232 17679640 1152 2039 349 0 0 0 0 0 22 0 7 30496 22545 14116 18 4 77
0 0 0 33652768 17658800 2108 4781 164 2 2 0 0 0 3 0 5 37047 23362 17242 21 5 74
0 0 0 33604288 17606696 989 2411 211 2 2 0 0 0 10 0 0 29848 16244 11399 17 4 79
0 0 0 33602216 17595896 102 1073 8 0 0 0 0 0 4 3 3 30349 15371 12069 16 4 80
###sar###--See man page for further details---
The sar command writes to standard output the contents of selected cumulative activity counters in the operating system. So it can extracts and writes to standard output records previously saved in a file.
In general, the syntax for invoking sar is sar -flags interval number. This causes a specific number of data points to be gathered every interval seconds. When looking at memory statistics, the most important flags are -g, -p, and -r. Here's an example of the output generated:
root:~ # sar -gpr 5 5
SunOS 5.10 Generic_118833-33 sun4u 03/02/2011
12:07:52 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
atch/s pgin/s ppgin/s pflt/s vflt/s slock/s
freemem freeswap
Average 0.00 0.00 0.00 0.00 0.00
Average 95.26 0.08 0.16 402.15 778.84 0.40
Average 2326386 69272492
Flag Field Meaning
-g pgout/s Page-out requests per second
ppgout/s Pages paged out per second
pgfree/s Pages placed on the free list per second by the page scanner
pgscan/s Pages scanned per second by the page scanner
%ufs_ipf The percentage of cached filesystem pages taken off the free list while they still contained valid data; these pages are flushed and cannot be reclaimed (see )
-p atch/s Page faults per second that are satisfied by reclaiming a page from the free list (this is sometimes called an attach)
pgin/s The number of page-in requests per second
ppgin/s The number of pages paged in per second
pflt/s The number of page faults caused by protection errors (illegal access to page, copy-on-write faults) per second
-r freemem The average amount of free memory
freeswap The number of disk blocks available in paging space
###memstat###---See man page for further details---
The command memstat identify what's using up virtual memory, lists all the processes, executables, and shared libraries that are using up virtual memory. It's helpful to see how the shared memory is used and which 'old' libs are loaded.
# memstat 5 2
memory ---------- paging----------executable- -anonymous---filesys -- --- cpu --
free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id
49584 0 1 5 0 0 0 0 0 0 0 0 0 0 5 0 0 1 1 1 98
56944 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
On Linux systems we can try also % cat /proc/12329/status and top commands to get information about performance.
bash#"vmstat -[option] (or top/mpstat for linux systems)
The option can be: -p to report the paging activiy in details, -S report on swapping rather than paging activity, or numeric options like vmstat 5 1 (summary of the system every 5 seconds on single row).
We have to check also if there are any telnet session lives, the way to discover this:
#bash "ps -leaf |grep ttyp" and compare the output to 'who |grep ttyp' to see if there are unaccounted sessions.
% /usr/ucb/ps aux
USER PID %CPU %MEM SZ RSS TT S START TIME COMMAND
root 16755 0.1 1.0 1448 1208 pts/0 O 17:33:35 0:00 /usr/ucb/ps uax
root 3 0.1 0.0 0 0 ? S May 24 6:19 fsflush
root 1 0.1 0.6 2232 680 ? S May 24 3:10 /etc/init -
###prstat###---See man page for further details---
The prstat utility iteratively examines all active processes on the system and reports statistics based on the selected output mode and sort order. prstat provides options to examine only processes matching specified PIDs, UIDs, zone IDs, CPU IDs, and processor set IDs.
root:~ # prstat -t or -a (estimates memory usage to high)
The vmstat (Virtual Memory Statistics) is a system monitoring tool that collects and displays summary information about OS memory, processes, interrupts, paging and block I/O. Users of vmstat can specify a sampling interval which permits observing system activity in near-real time.
###vmstat###---See man page for further details---
root:~ # vmstat 5 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr s1 sd sd sd in sy cs us sy id
0 0 0 36548016 22532120 257 889 590 5 5 0 0 0 6 0 3 7464 5758 5109 5 1 94
0 0 0 33598232 17679640 1152 2039 349 0 0 0 0 0 22 0 7 30496 22545 14116 18 4 77
0 0 0 33652768 17658800 2108 4781 164 2 2 0 0 0 3 0 5 37047 23362 17242 21 5 74
0 0 0 33604288 17606696 989 2411 211 2 2 0 0 0 10 0 0 29848 16244 11399 17 4 79
0 0 0 33602216 17595896 102 1073 8 0 0 0 0 0 4 3 3 30349 15371 12069 16 4 80
###sar###--See man page for further details---
The sar command writes to standard output the contents of selected cumulative activity counters in the operating system. So it can extracts and writes to standard output records previously saved in a file.
In general, the syntax for invoking sar is sar -flags interval number. This causes a specific number of data points to be gathered every interval seconds. When looking at memory statistics, the most important flags are -g, -p, and -r. Here's an example of the output generated:
root:~ # sar -gpr 5 5
SunOS 5.10 Generic_118833-33 sun4u 03/02/2011
12:07:52 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
atch/s pgin/s ppgin/s pflt/s vflt/s slock/s
freemem freeswap
Average 0.00 0.00 0.00 0.00 0.00
Average 95.26 0.08 0.16 402.15 778.84 0.40
Average 2326386 69272492
Flag Field Meaning
-g pgout/s Page-out requests per second
ppgout/s Pages paged out per second
pgfree/s Pages placed on the free list per second by the page scanner
pgscan/s Pages scanned per second by the page scanner
%ufs_ipf The percentage of cached filesystem pages taken off the free list while they still contained valid data; these pages are flushed and cannot be reclaimed (see )
-p atch/s Page faults per second that are satisfied by reclaiming a page from the free list (this is sometimes called an attach)
pgin/s The number of page-in requests per second
ppgin/s The number of pages paged in per second
pflt/s The number of page faults caused by protection errors (illegal access to page, copy-on-write faults) per second
-r freemem The average amount of free memory
freeswap The number of disk blocks available in paging space
###memstat###---See man page for further details---
The command memstat identify what's using up virtual memory, lists all the processes, executables, and shared libraries that are using up virtual memory. It's helpful to see how the shared memory is used and which 'old' libs are loaded.
# memstat 5 2
memory ---------- paging----------executable- -anonymous---filesys -- --- cpu --
free re mf pi po fr de sr epi epo epf api apo apf fpi fpo fpf us sy wt id
49584 0 1 5 0 0 0 0 0 0 0 0 0 0 5 0 0 1 1 1 98
56944 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100
On Linux systems we can try also % cat /proc/12329/status and top commands to get information about performance.
Subscribe to:
Posts (Atom)