Lately at work I’ve been helping out some teams migrating their workloads from on-prem / EC2 infrastructure to Kubernetes. It’s being a good boot camp on Kubernetes.

In this post I will go over one of the issues we faced recently, related to resource assignments for JVM applications. I will explain how running in Kubernetes forces us to think about capacity planning more than we’re used to, and changes some of the assumptions we made before containers.

I will assume familiarity with basic Kubernetes concepts (you know what are nodes, pods, etc..) and the JVM.

Help! My application is getting OOM killed

Resource limits are one of the most frequent pain points for teams new to Kubernetes. Our cluster configuration is relatively (too?) aggresive in promoting small, canonical microservices that scale horizontally. Developers are used to being liberal allocating memory for JVMs that usually run on their very own EC2 instance. Finding themselves suddenly constrained within pods under 5GB often results in various flavours of OOM kills or OutOfMemoryErrors

An engineer from one of our product teams found himself in this situation last week and was asking for help configuring their deployment. The Kubernetes manifest declared:

resources:
  limits:
    memory: 4Gi
    cpu: 1
  requests:
    memory: 1Gi
    cpu: 1

And the relevant part of the Dockerfile was:

FROM openjdk:8u181-jre-slim
...
ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada

Pods were being OOMKilled at inconvenient times, so he came asking for help in sizing his pods and JVM.

What are pod request / limits in Kubernetes

Let’s first see what are the resources parameters for.

When we ask Kubernetes to run our application, the Kubernetes scheduler looks for nodes in the cluster where our pods can run. It will search based on multiple criteria, where the most basic is whether a node has enough memory and CPU to run the container. This is what the parameters in the resources section are for. requests sets the minimum amount of resources each container in our Pod needs to start successfully. Kubernetes will schedule our pods only in suitable nodes. If there are not enough, then our pod will become Unschedulable. In the snippet above, the application declares a minimum of 1Gi of memory and 1 CPU as requests.

But requests is not a hard limit. Once it’s running, if the container needs additional memory / CPU it can ask the kernel for extra capacity from its node. This elasticity is useful to handle bursts of load, but also creates a risk that greedy pods hog too many resources. limits is there to control this by fixing a maximum memory / CPU that each container in the pod will be allowed to use.

The Kubernetes documentation has a good explanation on the tradeoffs that developers can play with by using requests and limits

By configuring memory requests and limits for the Containers that run in
your cluster, you can make efficient use of the memory resources
available on your cluster’s Nodes. By keeping a Pod’s memory request
low, you give the pod a good chance of being scheduled. By having a
memory limit that is greater than the memory request, you accomplish two
things:

- The pod can have bursts of activity where it makes use of memory
  that happens to be available.
- The amount of memory a pod can use during a burst is limited to
  some reasonable amount

Why is our Pod getting OOM killed?

Coming back to our Dockerfile, the first thing that calls our attention is:

ENTRYPOINT exec java -Xmx1512m -Xms1g [...] -cp app:app/lib/* com.schibsted.yada.Yada

The JVM will allocate 1GiB of heap up front, consuming the full capacity that the container we got through requests. Given that the JVM needs additional memory (code cache, off-heap, thread stacks, GC datastructures..), as does the operating system, our containers are born undersized.

It’s evident that requests is too small.

Wait: didn’t Hotspot 1.8 have problems tracking memory in containers?

This is the first thing to suspect when containerized JVMs are getting OOM killed. Our teams have hit this a few times. Long story short, when you use Hotspot JVM at any version below 8u121, you can start a JVM inside a container limited to 2GB of memory, and find that the JVM tries to start with a much larger heap and gets killed. The container is exceeding memory limits, but it’s the JVM’s fault for being too greedy. Why?

The cause of this problem was that the JVM used to retrieve the available memory. It used to look at a system file that contains the host’s memory capacity, not the container’s (or more precisely a kernel’s control group that is used to implement the container.) This issue was dealt with in JDK-8189497. From u181, the Hotspot JVM added a couple of flags that let you instruct the JVM to use the cgroup memory. From u121 this behaviour became default. We’re using a version > u181.

However, this issue was only relevant when the JVM had to calculate the min/max heap sizes by itself. In our case, since we’re passing the -Xmx and -Xms flags explicitly, so we’re not affected. We can confirm because both heap committed metrics above stay withn the 1.5GiB, as we’ll see below.

What would be a reasonable requests value then?

Here is a snapshot of the application under normal load. I’m showing the committed value as this represents memory actually reserved by the JVM, as opposed to usage (which = committed - not used).

metric memory
jvm_memory_committed_bytes{area=”nonheap”,id=”Code Cache”,} 0.05 GiB
jvm_memory_committed_bytes{area=”nonheap”,id=”Metaspace”,} 0.07 GiB
jvm_memory_committed_bytes{area=”nonheap”,id=”Compressed Class Space”,} 0.01 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Eden Space”,} 0.52 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Survivor Space”,} 0.01 GiB
jvm_memory_committed_bytes{area=”heap”,id=”PS Old Gen”,} 1.06 GiB
total committed 1.72 GiB

Notice that our JVM heap is not 1Gib but 1.5GiB. This is to be expected as the JVM may expand the heap once the application starts doing actual work and puts pressure on the heap, specially under load. However, thanks to the explicit -Xmx flag, we can trust that the heap won’t use more than 1.5GiB. requests=1Gi is therefore too low. But still, but limits=4Gi should be giving plenty of head room to prevent an OOM Kill.

What else could be pushing our memory footprint over limits?

One answer was outside the JVM. We looked at tmpfs mounts. These volumes look as ordinary directories in our filesystem, but they actually reside in memory and thus contribute to the overall memory consumption. This application is based on Kafka Streams, which uses a RocksDB instance for an internal cache that is persisted to disk. The team decided to use the /tmp mount to avoid IO overhead, which was a good idea, but did not account for the implicit tradeoff in additional memory usage. To make things worse, the /tmp folder was also storing logs for the application. All in all, RocksDB and logs added almost 2GiB to the container’s memory footprint. With limits set to 4GiB we were already flirting with the OOM killer.

This is the typical change of assumptions that appears when moving from physical machines or VMs (which tend to be large) to containers (which in our cluster tend to be much smaller.)

Our second measure was to relocate the RocksDB storage to disk (chosing less memory utilization over IO load) and ensure that the application kept log files small, and frequently rotated. After this change, the /tmp directory stays under 20MiB.

OOM killer keeps acting

While these changes made the situation better for a while, OOM kills kept happening.

So far we’ve just focused on the heap, but we know that the JVM uses memory for other purposes (some of which appear above: Code Cache, Metaspace, Compressed Class Space…). Let’s take a look at the real memory usage of the JVM process.

root@myapp-5567b547f-tk54j:/# cat /proc/1/status | grep Rss
RssAnon:         3677552 kB
RssFile:           15500 kB
RssShmem:           5032 kB

I believe these three add up to the RSS (Resident Set Size), which tells us the memory that our JVM process has in main memory at this point in time. Note that JVM alone is enough to consume 3.6GiB out of our 4GiB limit.

And we didn’t look at the operating system either. In containers memory information is found under the /sys/fs/cgroup tree, let’s check there:

root@myapp-5567b547f-tk54j:/# cat /sys/fs/cgroup/memory/memory.stat
cache 435699712
rss 3778998272
rss_huge 3632267264
shmem 9814016
mapped_file 5349376
dirty 143360
writeback 8192
swap 0
pgpgin 2766200501
pgpgout 2766061173
pgfault 2467595
pgmajfault 379
inactive_anon 2650112
active_anon 3786137600
inactive_file 198434816
active_file 227450880
unevictable 0
hierarchical_memory_limit 4294967296
hierarchical_memsw_limit 8589934592
total_cache 435699712
total_rss 3778998272
total_rss_huge 3632267264
total_shmem 9814016
total_mapped_file 5349376
total_dirty 143360
total_writeback 8192
total_swap 0
total_pgpgin 2766200501
total_pgpgout 2766061173
total_pgfault 2467595
total_pgmajfault 379
total_inactive_anon 2650112
total_active_anon 3786137600
total_inactive_file 198434816
total_active_file 227450880
total_unevictable 0

All these are bytes. The full Resident Set Size for our container is calculated with the rss + mapped_file rows, ~3.8GiB.

We see that mapped_file, which includes the tmpfs mounts, is low since we moved the RocksDB data out of /tmp. So in practise, non-JVM footprint is small (~0.2 GiB). The remaining 3.6GiB are consumed by our JVM. Our container runs with ~200 MiB to spare, so it’s likely that any burst in memory usage will tip us over the ‘limit’. cat /sys/fs/cgroup/memory/memory.failcnt will tell us how many times we hit memory usage limits

Interestingly, while our JVM is consuming 3.6GiB, we just use ~1.72GiB of heap committed according to the metrics we showed above. This means we have almost 2GiB unaccounted for in the JVM. The first suspect should be off-heap memory. A quick look at the JVM metrics discards this:

jvm_buffer_total_capacity_bytes{id="direct",} 284972.0
jvm_buffer_total_capacity_bytes{id="mapped",} 0.0

That’s barely 0.2MiB, so negligible in this JVM. Understanding what in the JVM is consuming the additional 2GiB will take another post.

For now I will wrap up this post by highlighting a couple of things I learned from this exercise, and some action points that we could already apply.

requests is a guarantee, limits is an obligation

There is a subtle change of semantics when we go from requests to limits. For the application developer, requests is a guarantee offered by Kubernetes that any pod scheduled will have at least the minumum amount of memory. limits is an obligation to stay under the maximum amount of memory, which will be enforced by the kernel.

In other words: containers can’t rely on being able to grow from their initial requests capacity to the maximum allowance set in limits.

This is problematic. Maybe it was the JVM that needed more capacity in order to grow the heap. If it doesn’t get it, we can expect at best a saturated heap with constant GC cycles, application pauses and increased CPU load. At worst, an OutOfMemoryError. Maybe it was the OS that needed more capacity for $purposes. In any case, if we fail to increase memory when needed, we will be implicitly asking the OOM killer to find a process to kill. The JVM has most tickets in that raffle, as it’s by far the biggest memory consumer.

Suggested changes

First, although not related to the OMM kill: just set -Xmx = -Xms and ensure your JVM reserves all the heap it will use up-front.

Having both JVM and Docker (or cgroups to be precise) growing memory dynamically just makes it harder to reason about capacity or understanding this type of issues.

Second, requests should be at least bigger than -Xmx, and ensure that to add enough head room for the JVM and the OS. How much more? Depends on what runs inside your container. A simple microservice will probably be fine with requests ~%25 higher than -Xmx. But does your microservice (or any library inside it) use off-heap memory? Write logs in a tmpfs volume? Read/Write volumes shared with other containers in the pod? All these (and other factors) impact your memory footprint. You need to know this in with reasonable detail if you want to avoid surprises.

Third, setting limits much higher than requests is meant to handle bursts of activity where your pod makes use of “memory that happens to be available”. Notice the wording. If getting that extra memory is a necessity for your app to do its job, don’t gamble and just reserve it up-front with requests.

Fourth, running applications inside containers forces us to have a solid idea of the memory requirements. I left unanswered why our JVM is using ~3.6GiB despite having 1.5GiB max heap, and no off-heap memory, I’ll try to write a followup to go deeper into this. But, assuming we know our memory footprint, my intuition (as a non-Kubernetes expert) is that we should prefer to have requests == limits at least for memory.

One last learning: don’t try to predict OS footprint

I mentioned that we should ensure that our requests leaves extra memory for the operating system. My temptation when I first looked at this data was to use the values we got from /sys/fs/cgroup/memory/memory.stat to predict future OS footprint (~0.2GiB) while running this application. By reading through the kernel and Docker documentation I realized this would probably not be wise.

When the kernel watches for the limits assigned to a container, it counts all of the RSS, as well as part of the page cache (the cache row, see see 2.2.1 and 2.3 for concrete accounting details). The page cache is used by the OS to speed up access to data in disk by keeping the most accessed pages in memory (whichcounts for memory utilization). But, what happens if several containers access the same file? The Docker docs explain this clearly:

Accounting for memory in the page cache is very complex. If two
processes in different control groups both read the same file
(ultimately relying on the same blocks on disk), the corresponding
memory charge is split between the control groups. It’s nice, but it
also means that when a cgroup is terminated, it could increase the
memory usage of another cgroup, because they are not splitting the cost
anymore for those memory pages.

The memory footprint of our container depends on both its neighbours and which files on the node’s disk they read. Because containers come and go, share of the memory accounted to a given container may change. If we’re already flirting with our limits this might push us into OOM Kill territory.


Any thoughts? send me an email!