Applies to:
Solaris SPARC Operating System – Version: 8.0 and later [Release: 8.0 and later ]
Solaris x64/x86 Operating System – Version: 8 6/00 U1 and later [Release: 8.0 and later]
Oracle Solaris Express – Version: 2010.11 and later [Release: 11.0 and later]
Information in this document applies to any platform.
Goal
Shortage of memory and virtual swap can result in slow system performance, hang, failure to start new process (fork failure), cluster timeout and thus unplanned outage. It is critical for system availability to monitor resource usage.
Solution
Physical Memory Shortages
Memory shortages can be caused by excessive kernel or application memory allocation and leaks. During memory shortages, the page daemon wakes up and starts scanning and stealing pages to bring the freemem, kernel global variable, value over the lotsfree kernel threshold. Systems with memory shortages slow down because memory pages may have to be read from the swap disk in order for processes to continue executing.
High kernel memory allocation can be monitored by using mdb’s memstat command. It reports kernel, application and file system memory usage:
# echo "::memstat"|mdb -k
Page Summary Pages MB %Tot———— ———– —— —-
Kernel 18330 143 7% < Kernel Memory
ZFS File Data 4 0 0% < ZFS cache (see below)
Anon 36405 284 14% < Application memory: heap, stack, COW
Exec and libs 1747 13 1% < Application libraries
Page cache 3482 27 1% < File system cache
Free (cachelist) 3241 25 1% < Free memory with vnode info.intact
Free (freelist) 195422 1526 76% < Free memory
Total 258627 2020
Physical 254812 1990
If system is running ZFS, then ZFS cache will also be listed. ZFS uses kernel memory to cache filesystem blocks. You can monitor ZFS cache memory usage using:
# kstat -n arcstats
kstat reports kernel memory usage in pages [8k(sparc), 4k(intel)]. It also reports memory in use by kernel and pages locked by applications.
# kstat -n system_pages
module: unix instance: 0name: system_pages class: pages
…
freemem 8337355 < available free memory
..
lotsfree 257271 < Paging starts when freemem drops below lotsfree
minfree 64317 < swapping will start if freemem drops below minfree
pageslocked 4424860 < pages locked excluding pp_kernel (kernel pages)
pagestotal 16465378 < total pages configured>
physmem 16487075 < total pages usable by solaris
pp_kernel 4740398 < memory allocated in kernel
—
kmstat reports memory usage in kernel slab caches. These caches are used by various kernel subsystem and drivers for allocating memory.
# echo "::kmastat"|mdb -k
cache buf buf buf memory alloc allocname size in use total in use succeed fail
———————- —— —— —— —— ——— —–
..
kmem_slab_cache 56 2455 2465 139264 2571 0
kmem_bufctl_cache 24 5463 5763 139264 6400 0
kmem_bufctl_audit_cache 128 0 0 0 0 0
kmem_va_8192 8192 74 96 786432 74 0
kmem_va_16384 16384 2 16 262144 2 0
kmem_va_24576 24576 5 10 262144 5 0
kmem_va_32768 32768 1 8 262144 1 0
kmem_va_40960 40960 0 0 0 0 0
kmem_va_49152 49152 0 0 0 0 0
kmem_va_57344 57344 0 0 0 0 0
kmem_va_65536 65536 0 0 0 0 0
kmem_alloc_8 8 97210 98649 794624 3884007 0
kmem_alloc_16 16 29932 30988 499712 9786629 0
kmem_alloc_24 24 43651 44409 1073152 69596060 0
kmem_alloc_32 32 11512 12954 417792 71088529 0
…
To isolate issues with high kernel memory allocation and leak, one needs to turn ON kernel memory auditing by setting a tunable below in /etc/system file and reboot:
set kmem_flags=0x1
Continue to run kmastat on a regular basis and monitor the growth of kernel caches. Force a system panic when kernel memory allocation reaches an alarming level. Send the kernel core dump located in /var/crash directory to oracle support for analysis:
Sun SPARC(R) Enterprise Mx000 (OPL) Servers: How to deal with a hung or unresponsive domain ?Best way to avoid outages due to kernel memory leak is to keep kernel patches up to date.
To monitor application memory usage consider using:
$prstat -s rss -can 100
$ps -eo ‘addr zone user s pri pid ppid pcpu pmem vsz rss stime time nlwp psr args’
To see which memory segment in the process has high memory allocation:
$pmap -xs <pid>
Continued growth in application memory usage is a sign of a memory leak. You may request the application vendor to provide you tools or consider linking to libumem(3LIB) that offers a rich set of debugging facilities. See article on how to use it. You can monitor application malloc() using DTrace scripts.
Process allocation (via malloc()) requested size distribution plot:
dtrace -n 'pid$target::malloc:entry { @ = quantize(arg0); }' -p PID
Process allocation (via malloc()) by user stack trace and total requested size:
dtrace -n 'pid$target::malloc:entry
{ @[ustack()] = sum(arg0); }’ -p PID
Virtual Memory Shortages:
Processes use virtual memory. A process’ virtual address space is made up of a number of memory segments: text, data, stack, heap, cow segments. When a process accesses the virtual address, it results in a page fault that brings the data into physical memory. The faulted virtual address is then mapped to physical memory. All pages reside in the memory segment and have backing store where the pages within the segment can be migrated during memory shortages. Text/data segments are backed by executable file on the file system. Stack, heap, COW (copy-on-write) and shared memory pages are anonymous (Anon) pages and they are backed up by virtual swap.
DISM requires swap reservation considering memory can be locked and unlocked by the process.
When a process starts touching pages then anon structures are allocated, there is no physical disk swap allocated. Swap allocation in Solaris only happens when memory is short and pages need to be migrated to the swap device to keep up with workload memory demand. That is the reason, “swap -l” that reports physical disk swap allocation shows same value in “block” and “free” columns during normal conditions.
Solaris can run without physical disk swap and that is due to swapfs abstraction that acts as if there is a real swap space backing up the page. Solaris works with virtual swap and it is composed of physical memory and physical disk swap. When there is no physical disk swap configured, swap reservation happens against physical memory. Swap reservation against memory has a draw back and that is the system cannot do malloc() bigger than the physical memory configured. Advantage of running without physical disk swap is that the malicious program unable to do huge mallocs and thus cannot cause the system to crawl due to memory shortages.
Virtual swap = Physical memory + Physical Disk swap
Available virtual swap is reported by:
- vmstat: swap
- swap -s
Disk back swap is reported by:
- swap -l
Per process virtual swap reservation can be displayed:
pmap -S <pid>
prstat can provide virtual memory usage (SIZE) of the process, however it contains all virtual memory used by all memory segment not just anon memory:
prstat -s size -can 100 15″
- prstat -s size -can -p <pidlist> 100 15
You can dump the process address space showing all segment using:
pmap -xs <pid>
When a process calls malloc()/sbrk() only virtual swap is reserved. Reservation is done against the physical disk swap first. If that is exhausted or not configured then reservation is done against physical memory. If both are exhausted then malloc() fails. To make sure malloc() won’t fail due to lack of virtual swap configure large physical disk swap in the form of disk or file. You can monitor swap reservation via “swap -s” and “vmstat:swap”, as described above
On a system with plenty of memory, “swap -l” reports the same value for “block” and “free” column
“swap -l” reporting a large value in “free” does not mean that there is plenty of virtual swap available and thus malloc will not fail because “swap -l” does not provide information about virtual swap usage, it only provides information about physical disk swap allocation. It is “swap -s” and “vmstat:swap” that reports information about how much virtual swap available for reservation.
Script to monitor memory usage:
#!/bin/ksh
# Script monitors kernel and application memory usage
PATH=/bin:/usr/bin:/usr/sbin; export PATH
trap “killall” HUP INT QUIT KILL TERM USR1 USR2
killall()
{
for PID in $PIDLIST
do
kill -9 $PID 2>/dev/null
done
exit
}
restart()
{
for PID in $PIDLIST
do
kill -9 $PID 2>/dev/null
done
}
DIR=DATA.`date +%Y%m%d-%T`
TS=`date +%Y%m%d-%T`
mkdir $DIR
cd $DIR
while true
do
TS=`date +%Y%m%d-%T`
echo $TS >> mem.out
echo “output of ::memstat” >> mem.out
echo ::memstat|mdb -k >> mem.out
echo “output of kstat -n ZFS ARC memory usage” >> mem.out
kstat -n arcstats >> mem.out
echo “output of ::kmastat” >>mem.out
echo “::kmastat”|mdb -k >> mem.out
echo “output of swap -s and swap -l” >>mem.out
echo “swap -s” >>mem.out
swap -s >>mem.out
echo “swap -l” >>mem.out
swap -l >>mem.out
echo “output of ps” >>mem.out
/usr/bin/ps -eo ‘addr zone user s pri pid ppid pcpu pmem vsz rss stime time nlwp psr args’ >>mem.out
#
# start vmstat, mpstat and prstat in the background
#
PIDLIST=””
echo $TS >>vmstat.out
vmstat 5 >> vmstat.out &
PIDLIST=”$PIDLIST $!”
echo $TS >>mpstat.out
mpstat 5 >> mpstat.out &
PIDLIST=”$PIDLIST $!”
echo $TS >>prstat.out
prstat -s rss -can 100 >>prstat.out &
PIDLIST=”$PIDLIST $!”
sleep 600 # every 10 minutes
restart
done
Comments