Friday, October 2, 2015

Memory Leak, Alarm Server, and Valgrind

Note - the output listed here comes after the process was followed, so you won't see the real memory leaks.  Sorry I didn't retain them for posterities' sake.

I had a problem.  My alarm server was having some major issues, and would stop functioning at random times.  There was no core dump to look for crossed signals (no odd null-termination things that I am used to).  However, after a little digging, I found that the process was getting sniped - meaning the Linux kernel was killing it - because it was running out of memory.  Not cool.  I didn't really know how to go about fixing the problem, so I band-aided the thing by putting a cron to reboot the device once-per-week.  I ran with it in that configuration while working on other priorities for a couple of months.  Also not cool, but it kept the security system online while I went and handled my growing project list.

So, I started researching ways to isolate the memory leaks.  One tool stood out as the best one to use (command-line based), and would work with a natively-compiled ELF binary.

Valgrind

Val grind does a lot of things, and memory leak detection is just a small slice.  I downloaded the source, and compiled and installed it (using the standard "./configure && make && make install" stuff).  I did many Google searches, and finally came up with the following command to launch the server inside of Valgrind for memory leak detection :
    G_SLICE=always-malloc G_DEBUG=gc-friendly valgrind -v --tool=memcheck --leak-check=full --num-callers=40 --track-origins=yes --read-var-info=yes ./eventserver/server/eventserver -bf ./eventserver/server/eventserver.conf > valgrind.log 2>&;1
    
Essentially, I wanted to track everything related to memory interaction - I was (after all) looking for a memory leak.  I wanted to see if this would help me identify the issue.  I ran with it in this state for a while (in a screen session, to be precise) to let it build up stats.  When I hit control-C on the application to terminate, I opened up the valgrind.log file, and found a few stanzas similar to :
    ==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
    ==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
    ==12262==    by 0xAA59: eventserver_key_set (config.c)
    ==12262==    by 0xAB55: eventserver_sensor_create (config.c)
    ==12262==    by 0x4D4DDA9: arduino_handle_request (arduino.c)
    ==12262==    by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c)
    ==12262==    by 0xDC0D: eventserver_main_connection_handler (main.c)
    ==12262==    by 0xE29D: main (main.c)
    ==12262==
    ==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
    ==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
    ==12262==    by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)
    
Not sure what I was looking at yet, I knew it looked similar to a gdb stack trace.  The unfortunate thing was that in the traces I was looking at, it was finding things allocated using malloc, and memory pointer problems using strchr.  Since the files referenced called the functions numerous times, I didn't have much to go on.  I went back to Google, and found some additional information.  Apparently, using some compilation flags to gcc, you could add gdb debugging information, such as line numbers.  I manually added the following to the CFLAGS in all of my make files :
    -g -O0
    
The -O0 (capital "O" and the number zero) are an optimization flag that means "don't optimize anything, seriously, leave a lot of junk in the binary" (I swear that's what it says in the man page for gcc).  The -g option means "add gdb debugger information".  You can run with a -O1, and that gets you close to the line numbers, but might not always be accurate.  However, -O0 runs molasses-slow, so use sparingly.  I thought I'd do it anyway.

I started the application again, let it run for a few days, and broke out of it once again.  This time, I had line numbers in the output.  Hallelujah!
    ==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
    ==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
    ==12262==    by 0xAA59: eventserver_key_set (config.c:219)
    ==12262==    by 0xAB55: eventserver_sensor_create (config.c:256)
    ==12262==    by 0x4D4DDA9: arduino_handle_request (arduino.c:228)
    ==12262==    by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c:257)
    ==12262==    by 0xDC0D: eventserver_main_connection_handler (main.c:494)
    ==12262==    by 0xE29D: main (main.c:669)
    ==12262==
    ==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
    ==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
    ==12262==    by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)
    
This was great news!  Suddenly, I could look back through the stack traces and find out where things were happening.  The application initially identified a couple hundred memory leak contexts, and a couple hundred memory errors (seen with the "definitely lost in loss record ## of ##").  On research, I realized that the vast majority of these all had a fairly similar stack trace and line number (maybe one or two lines off).  I opened that file and jumped to the line number and.....

... I had the wrong variable I was using to allocate a chunk of memory.  I made the fix, re-compiled, and had instantly narrowed down the memory error contexts to about three basic memory errors - things where I was using NULL-termination-requiring functions on variables that may not have had the termination and a MySQL error leak.  I fixed the NULL-termination errors, recompiled yet again, and had only the MySQL memory leak (appears standard with that version, which is very old). No more memory leaks!

No comments:

Post a Comment