We recently ran into the “java.lang.OutOfMemory: PermGen space” issue. The experience has been a real eye opener. In SMART we use Classloader isolation to isolate multiple tenants. We started testing it in beta a few weeks back and found that every 2 days our server went down giving OutOfMemory: PermGen space. We had 6 tenants running with little or very less data. It was very worrying since this started happening when the server was not accessed and the load on it was very less. For us it was evident that the leak was the classloaders. But the question was why was it not getting garbage collected. To summarize our findings, the following had to be fixed before we could get the classloader to garbage collect:
- JCS cache clearing
- Solr Searcher threads
- Static Variables
- Threads and Threadpools
Tracking the leaks
Before I talk in detail about each of the item in the above list, let me tell how we tracked down these leaks. The major problem in tracking and fixing memory leaks is re-creating the problem consistently. If you are able to recreate it then half the problem is solved. It took us sometime and a lot of outside the box thinking, but we were able to pin-point exactly the set of steps to be done to recreate the problem. We had to remove tenants from JCS cache and reload them again into the cache. Do this 4 or 5 times and we could recreate out OutOfMemory problem.
When we initially started seeing this issue we added the standard java parameters to dump heap when the process went out of memory.
This helped us to point to the ClassLoader not being garbage collected as the problem. But the dump occurred every 2 days and irregularly based on how the server was used and every time a fix was put we had to wait two days to see if the fix had really worked.
First lesson learnt, you don’t have to run out of memory to dump heap. A very useful tool that helps in heap analysis is jmap that comes with jdk 6. Two really useful commands using jmap are:
jmap -permstat <pid>
This command shows the classloaders in the perm gen space and their status if they are live or dead. This is very useful to check if the classloaders have been garbage collected.
jmap -dump:format=b,file=heap.bin <pid>
This dumps the heap into the heap.bin file and can then be examined to find the reason for the classloader not being garbage collected. We used the tool called visualvm another very useful tool. It helps view the dumped heap and can show the nearest GC root that is holding an object in the heap to prevent it from garbage collecting.
With these tools it became an iterative process of:
- Recreate the problem with a single tenant
- We reduced the JCS time out to be as small as 2 mins.
- The tenant now got removed from the cache every 2 mins
- Dump heap using the jmap command
- Examine the heap using Visual VM
- Find the object holding the classloader and fix the leak.