SilverHawk: Memory Leak, Alarm Server, and Valgrind

Note - the output listed here comes after the process was followed, so you won't see the real memory leaks. Sorry I didn't retain them for posterities' sake.

I had a problem. My alarm server was having some major issues, and would stop functioning at random times. There was no core dump to look for crossed signals (no odd null-termination things that I am used to). However, after a little digging, I found that the process was getting sniped - meaning the Linux kernel was killing it - because it was running out of memory. Not cool. I didn't really know how to go about fixing the problem, so I band-aided the thing by putting a cron to reboot the device once-per-week. I ran with it in that configuration while working on other priorities for a couple of months. Also not cool, but it kept the security system online while I went and handled my growing project list.

So, I started researching ways to isolate the memory leaks. One tool stood out as the best one to use (command-line based), and would work with a natively-compiled ELF binary.

Valgrind

Val grind does a lot of things, and memory leak detection is just a small slice. I downloaded the source, and compiled and installed it (using the standard "./configure && make && make install" stuff). I did many Google searches, and finally came up with the following command to launch the server inside of Valgrind for memory leak detection :

G_SLICE=always-malloc G_DEBUG=gc-friendly valgrind -v --tool=memcheck --leak-check=full --num-callers=40 --track-origins=yes --read-var-info=yes ./eventserver/server/eventserver -bf ./eventserver/server/eventserver.conf > valgrind.log 2>&;1

Essentially, I wanted to track everything related to memory interaction - I was (after all) looking for a memory leak. I wanted to see if this would help me identify the issue. I ran with it in this state for a while (in a screen session, to be precise) to let it build up stats. When I hit control-C on the application to terminate, I opened up the valgrind.log file, and found a few stanzas similar to :

==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262==    by 0xAA59: eventserver_key_set (config.c)
==12262==    by 0xAB55: eventserver_sensor_create (config.c)
==12262==    by 0x4D4DDA9: arduino_handle_request (arduino.c)
==12262==    by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c)
==12262==    by 0xDC0D: eventserver_main_connection_handler (main.c)
==12262==    by 0xE29D: main (main.c)
==12262==
==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262==    by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)

Not sure what I was looking at yet, I knew it looked similar to a gdb stack trace. The unfortunate thing was that in the traces I was looking at, it was finding things allocated using malloc, and memory pointer problems using strchr. Since the files referenced called the functions numerous times, I didn't have much to go on. I went back to Google, and found some additional information. Apparently, using some compilation flags to gcc, you could add gdb debugging information, such as line numbers. I manually added the following to the CFLAGS in all of my make files :

-g -O0

The -O0 (capital "O" and the number zero) are an optimization flag that means "don't optimize anything, seriously, leave a lot of junk in the binary" (I swear that's what it says in the man page for gcc). The -g option means "add gdb debugger information". You can run with a -O1, and that gets you close to the line numbers, but might not always be accurate. However, -O0 runs molasses-slow, so use sparingly. I thought I'd do it anyway.

I started the application again, let it run for a few days, and broke out of it once again. This time, I had line numbers in the output. Hallelujah!

==12262== 891 (108 direct, 783 indirect) bytes in 9 blocks are definitely lost in loss record 86 of 93
==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262==    by 0xAA59: eventserver_key_set (config.c:219)
==12262==    by 0xAB55: eventserver_sensor_create (config.c:256)
==12262==    by 0x4D4DDA9: arduino_handle_request (arduino.c:228)
==12262==    by 0xC795: eventserver_process_hook_socket_handle_incoming (process.c:257)
==12262==    by 0xDC0D: eventserver_main_connection_handler (main.c:494)
==12262==    by 0xE29D: main (main.c:669)
==12262==
==12262== 65,550 bytes in 5 blocks are possibly lost in loss record 91 of 93
==12262==    at 0x482CF8C: malloc (vg_replace_malloc.c:236)
==12262==    by 0x4DA5A23: my_malloc (in /usr/lib/libmysqlclient.so.16.0.0)

This was great news! Suddenly, I could look back through the stack traces and find out where things were happening. The application initially identified a couple hundred memory leak contexts, and a couple hundred memory errors (seen with the "definitely lost in loss record ## of ##"). On research, I realized that the vast majority of these all had a fairly similar stack trace and line number (maybe one or two lines off). I opened that file and jumped to the line number and.....

... I had the wrong variable I was using to allocate a chunk of memory. I made the fix, re-compiled, and had instantly narrowed down the memory error contexts to about three basic memory errors - things where I was using NULL-termination-requiring functions on variables that may not have had the termination and a MySQL error leak. I fixed the NULL-termination errors, recompiled yet again, and had only the MySQL memory leak (appears standard with that version, which is very old). No more memory leaks!

Pages

Friday, October 2, 2015

Memory Leak, Alarm Server, and Valgrind

No comments:

Post a Comment