Linux: What to do when the oom-killer is being triggered when you have plenty of memory

Does this sound familiar to you?

Your Linux box has plenty of memory, but the kernel “oom-killer” keeps getting triggered and killing processes. The stack trace logged by the kernel shows that out_of_memory was triggered inside _do_fork (for me, it was mysqld forks that were triggering the problem; though I’d been running mysqld on my server for years without any issues, it started getting killed on fork when I started running mongod as well). Maybe you’ve tried adding swap space, but that didn’t help either.

Maybe you’ve tried searching the web for a solution to this problem, but your Google-fu has not been good enough to find one. Well, you’re in luck, because I figured it out, and I’m going to share the solution with you.

The Linux kernel divides memory into different zones, and some things are required to go in specific zones. In particular, program code, i.e., the actual CPU instructions for a program that get loaded into memory and executed, needs to go into the “lowmem” zone. When a program forks, there needs to be enough room for it in lowmem; if there isn’t, either the fork will fail or some other process will get killed to make room for the new one.

If the OOM killer is getting triggered on your box when a process tries to work, that means that you need to tell the kernel to reserve more memory for the lowmem zone. This is done by setting the vm.lowmem_reserve_ratio setting with sysctl. Quoting from the kernel’s “vm.txt” documentation file:

lowmem_reserve_ratio

For some specialised workloads on highmem machines it is dangerous for
the kernel to allow process memory to be allocated from the "lowmem"
zone.  This is because that memory could then be pinned via the mlock()
system call, or by unavailability of swapspace.

And on large highmem machines this lack of reclaimable lowmem memory
can be fatal.

So the Linux page allocator has a mechanism which prevents allocations
which _could_ use highmem from using too much lowmem.  This means that
a certain amount of lowmem is defended from the possibility of being
captured into pinned user memory.

(The same argument applies to the old 16 megabyte ISA DMA region.  This
mechanism will also defend that region from allocations which could use
highmem or lowmem).

The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
in defending these lower zones.

If you have a machine which uses highmem or ISA DMA and your
applications are using mlock(), or if you are running with no swap then
you probably should change the lowmem_reserve_ratio setting.

The lowmem_reserve_ratio is an array. You can see them by reading this file.
-
% cat /proc/sys/vm/lowmem_reserve_ratio
256     256     32
-
Note: # of this elements is one fewer than number of zones. Because the highest
      zone's value is not necessary for following calculation.

But, these values are not used directly. The kernel calculates # of protection
pages for each zones from them. These are shown as array of protection pages
in /proc/zoneinfo like followings. (This is an example of x86-64 box).
Each zone has an array of protection pages like this.

-
Node 0, zone      DMA
  pages free     1355
        min      3
        low      3
        high     4
	:
	:
    numa_other   0
        protection: (0, 2004, 2004, 2004)
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  pagesets
    cpu: 0 pcp: 0
        :
-
These protections are added to score to judge whether this zone should be used
for page allocation or should be reclaimed.

In this example, if normal pages (index=2) are required to this DMA zone and
watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
not be used because pages_free(1355) is smaller than watermark + protection[2]
(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
normal page requirement. If requirement is DMA zone(index=0), protection[0]
(=0) is used.

zone[i]'s protection[j] is calculated by following expression.

(i < j): zone[i]->protection[j]
  = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
    / lowmem_reserve_ratio[i];
(i = j):
   (should not be protected. = 0;
(i > j):
   (not necessary, but looks 0)

The default values of lowmem_reserve_ratio[i] are
    256 (if zone[i] means DMA or DMA32 zone)
    32  (others).
As above expression, they are reciprocal number of ratio.
256 means 1/256. # of protection pages becomes about "0.39%" of total managed
pages of higher zones on the node.

If you would like to protect more pages, smaller values are effective.
The minimum value is 1 (1/1 -> 100%).

That’s a lot of mumbo-jumbo, eh? Here’s what I did to apply it in the real world:

Run “sysctl vm.lowmem_reserve_ratio” to find out what the current setting is. On my computer, it was “256 256 32“; I don’t know if that’s the hard-coded default, or rather it varies from system to system based on other variables such as total available memory.
Create a file called /etc/sysctl.d/00-reserve-ratio.conf with one line in it: “vm.lowmem_reserve_ratio = 128 128 32“, i.e., divide the first two numbers from the output of the previous step in half (if /etc/sysctl.d doesn’t exist on your system, that means it uses some other mechanism for how to configure sysctl, and you need to figure that out and do the right thing).
Reboot your computer.
Confirm that “sysctl vm.lowmem_reserve_ratio” now returns “128 128 32” (or whatever you put in 00-reserve-ratio.conf).

I don’t claim to be a Linux kernel expert. I don’t 100% understand everything that’s going on here. Maybe somebody else has spelled this out somewhere on the web and I just couldn’t find it. All I’m saying is that the fix described above solved the problem for me, and maybe it’ll solve the problem for you as well. If so, comment below and let me know!

Linux: What to do when the oom-killer is being triggered when you have plenty of memory

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply

Discover more from Something better to do