Memory leaks and java.lang.OutOfMemoryError

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Memory leaks and java.lang.OutOfMemoryError

Ron Hitchens
We are trying to stand up an internal production RatPack service and are having severe problems with java.lang.OutOfMemoryError failures.  We've done quite a lot of investigation on this and it appears that there is some sort leak of direct memory buffers happening.

Let me first explain what our app is doing, then the failure modes we've seen and finally ask for your help to find a solution or workaround.

Our RatPack app is a REST service that processes publishing content (mostly books).  It has the usual endpoints that can query or fetch content as XML or JSON (this all works fine) but its primary function (for now, at least) is to ingest content deliveries.  These deliveries are POSTed as zip files that contain a relatively large XML file and some number of binaries (PDFs and images).  The app writes the POST body to disk and then opens it via the JDK zip API.  It validates the structure, then sends the binaries to Amazon S3 (synchronously at the moment, using the Amazon SDK).  Finally, the XML payload is transformed/enhanced and sent to an XML store via a backend REST service (we're using the RatPack async HTTP client for that).  This XML "payload" XML can be relatively large (10s or even 100s of megabytes, book content plus metadata).

To summarize: large zip files posted (potentially as big as 1GB), binaries extracted and sent to S3, large XML built in memory and sent to another REST service.  All happening in one long-running request.  The deployment environment is Amazon ECS.

There are two OS environments in play here.  We develop on Macs and have only seen problems there when tuning the amount of direct memory so low that it can't buffer and process a single delivery.  With enough direct memory, it seems to remain stable indefinitely.  However, when deploying the same code on Linux, even with plenty of direct memory space allocated, the RatPack app will eventually start throwing OutOfMemoryError.  Once this starts happening most deliveries, unless they are small, will throw this error.  Eventually everything will fail.  The app never seems to recover from this.  This strongly suggests a leak of direct memory buffers.  Question is: where and why only on Linux?

We thought we'd found a workaround to by disabling direct memory and using heap buffers by adding this JVM property: -Dio.netty.noUnsafe.  On the Mac, this worked brilliantly.  But we quickly discovered that on Linux, RatPack unconditionally picks the EPoll SocketChannel implementation (ChannelImplDetector.java).  This trips up Netty, setting this property on Linux makes everything fail.

Suggestion #1: If the RatPack code would honor -Dio.netty.noUnsafe and not choose EPoll if it's set, that would be helpful

We need to receive large uploads, up to about 1GB.  We've noticed that setting the maxContentLength config value higher allocates more direct buffer space even it's not needed.

Suggestion #2: Would it be possible to receive the request body into a linked list of chunks so that you only allocate what's needed for what's received.  Copying the chucks to a final single buffer is a price worth paying.  Having the option to place the final result in a heap buffer would be useful as well.  Even better would be to automatically place bodies larger than n in the heap.

In our particular use case, we receive a large POST body and then write it to disk.  At which point we no longer need the POST body for the remainder of response processing.

Suggestion #3: Providing a way to explicitly release the request body when finished with it could be a huge help to scalability for cases like this.

Lastly, we're at a loss to explain why direct memory buffer space seems to be leaking away on Linux.  As previously mentioned, we haven't seen this on Mac, only on Linux, which is suggestive.

It seems unlikely that there is a general problem with Netty using EPoll on Linux that would manifest like this.  Is it possible that RatPack code is live-leaking buffers?  I haven't had the bandwidth to study all the RatPack code myself.

We are running RatPack version 0.9.18.  I know there is a newer version, but I don't want to perturb things by upgrading until this problem is identified.

Any thoughts or suggestions would be greatly appreciated.  We are under some significant pressure by the client to get this issue resolved ASAP.  A reliable workaround would be a lifesaver at this point.  Otherwise we need to seriously consider moving off of RatPack, which we really don't want to do.

Thanks very much.

Ron Hitchens
[email protected]

Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Ron Hitchens
So I believe I've discovered the root cause of this memory leak problem.  It appears that RatPack is leaking Netty message buffers.

I suspected that I/O buffers were not being released properly at the end of a request.  But after some discussion on the Slack channel (thanks @ldaley) we thought we'd eliminated that possibility because the DefaultResponseTransmitter class is doing balanced retain/release calls.  But after eliminating every other possibility we could think of, I revisited that code.  After staring at it for a while, I noticed this in the NettyHandlerAdapter class, which is the receiver of messages from Netty:

  public void channelRead0(final ChannelHandlerContext ctx, final FullHttpRequest nettyRequest) throws Exception {
    if (!nettyRequest.decoderResult().isSuccess()) {
      sendError(ctx, HttpResponseStatus.BAD_REQUEST);
      nettyRequest.release();
      return;
    }
   ...

This is the Sad Path: If we don't want this message we release it, which decrements the reference count to zero.

But on the Happy Path, further down that method, the nettyRequest message is passed to the constructor of DefaultResponseTransmitter.  Inside there, retain() is called to increment the reference count.  Later, when the request is done, release() is called on the message to decrement the reference count.  Which seems perfectly fine.

But then I thought to myself: "Self, why is there both retain() and release() on the Happy Path, but only release() on the Sad Path?"

So I looked at the Netty code which calls NettyHandlerAdapter.channelRead0, which is in the superclass.  It looks like this:

    public void channelRead(ChannelHandlerContext ctx, Object msg) throws Exception {
        boolean release = true;
        try {
            if (acceptInboundMessage(msg)) {
                @SuppressWarnings("unchecked")
                I imsg = (I) msg;
                channelRead0(ctx, imsg);             // message is passed to RatPack here
            } else {
                release = false;
                ctx.fireChannelRead(msg);
            }
        } finally {
            if (autoRelease && release) {
                ReferenceCountUtil.release(msg);    // Auto-release happens here, if autoRelease == true
            }
        }
    }

But hey, wait a minute.  On the Sad Path, we also did a release().  Wouldn't that decrement below zero?  What is the value of autorelease anyway?  That turns out to be set on the constructor, which is called in NettyHandlerAdapter's constructor:

  public NettyHandlerAdapter(Registry serverRegistry, Handler handler) throws Exception {
    super(false);
    ...

OK, this message was not auto-released, so doing an explicit release() on the Sad Path was correct.  But what about the Happy Path?

Bingo.  On the Happy Path, the message is already retained and not released automatically.  RatPack bumps the ref count when it receives the message and decrements it when finished with it.  That leaves the buffer marked as still in use when RatPack finishes with it.  It's leaking Netty buffers.

Our profiling and testing support this conclusion, we could see memory slowly leaking away until the JVM hits its limits.

Happily, the fix is very simple.  A two line change to NettyHandlerAdapter makes a huge difference.  Our app now happily gobbles and processes large POST bodies indefinitely with no apparent memory growth after reaching a reasonable plateau.

The fix:

1) Change NettyHandlerAdapter's constructor to pass 'true' to the superclass constructor:

  public NettyHandlerAdapter(Registry serverRegistry, Handler handler) throws Exception {
    super(true);
    ...

2) Remove or comment out the line on the Sad Path that explicitly releases the message, because Netty will now do it automatically on return from the call:

  public void channelRead0(final ChannelHandlerContext ctx, final FullHttpRequest nettyRequest) throws Exception {
    if (!nettyRequest.decoderResult().isSuccess()) {
      sendError(ctx, HttpResponseStatus.BAD_REQUEST);
  //    nettyRequest.release();
      return;
    }
   ...

This issue has caused us no end of pain over the last few weeks.  It was a really tough to track it down, but we're glad to now have a smoking gun and a simple fix.  We have a patched version of the NettyHandlerAdapter class deployed in our environment.  If you could get this fix into the code base we'd really appreciate it.

Thanks.
[email protected], @ronhitchens
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Luke Daley
Administrator
Thanks for hunting this down Ron. I've just pushed a fix.

https://github.com/ratpack/ratpack/commit/bfc7362b3b69e44a3a1e1f24786004987197bc83

I'm still thinking about how we can safeguard against request body leaks. This is going to be more complicated with the upcoming lazy body stuff.
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Tom Akehurst
Great analysis, thanks for posting.

We've been seeing this also and managed to work around it by add this to our (OpenJDK only I think) JVM opts:
-XX:MaxDirectMemorySize=256m
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Luke Daley
Administrator
If you suspect a buffer leak, you can use http://netty.io/4.0/api/io/netty/util/ResourceLeakDetector.html#setLevel(io.netty.util.ResourceLeakDetector.Level)

You can run with `-Dio.netty.leakDetectionLevel=PARANOID` in a test env.
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Tom Akehurst
Hmmm...seems setting MaxDirectMemory isn't helping us after all. We're now seeing a consistent leak in Heroku.

Enabling leak detection as per your instructions causes periodically logs leaks. Oddly the stack trace always contains one of a couple of our Handlers. We have dozens and I'd have expected this problem to be spread across them fairly evenly.

We've tried patching Ratpack with both your (Luke's) fix and Ron's, but neither have completely sorted the problem (Ron's seemed to slow the leak a bit).

Do you have any further ideas what we might try?
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Ron Hitchens
I don't have any advice for you, Tom.  Both fixes should have had the sane net result.  With a JVM profiler, we saw a pretty flat memory usage pattern after everything was warmed up.
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Luke Daley
Administrator
In reply to this post by Tom Akehurst

Tom, is this something different to the leak you found when the client closes the connection before the response is sent?
Reply | Threaded
Open this post in threaded view
|

Re: Memory leaks and java.lang.OutOfMemoryError

Marcin Erdmann
As far as I know, it's the leak when the client closes the connection before the response is sent.