Thursday, March 22, 2012

Linux memory tweaks for improved responsiveness

Memory management techniques in the linux kernel has changed over the years.  Sometimes a kernel upgrade, or a hardware upgrade, where you expect better performance, can lead to disappointment.

For a while now, I've been struggling with our ubuntu 10.04 application server (managing 35 desktop sessions) where we experience "morning sluggishness" with the system.  Between 6am and noonish, the system was just not as responsive as it could be, but in the afternoons it became quite zippy.  I've been explaining this away with things like "everyone is just logging in morning, system is seeing a heavy load early in the day".. or "everyone does more work in the morning, and take it easy after lunch".

I decided to analyze memory usage patterns combined with I/O wait times.  Between 8am and noon, there was an average I/O wait time of between 2% - 3%, and between noon and 4pm an average of < 1%.  When I graphed out memory usage patterns, I noticed that our nightly backups were consuming a huge amount of buffered memory (understandably).. and that this buffer was slowly reclaimed overtime and around noonish every day it got close to cache size.  This couldn't be coincidence.. performance steadily increased as the buffers from previous nights backup were free'd up.  Makes total sense.. but question was.. how do I modify this behaviour?

Googling around, I came across this link

I already knew about vm.swappiness - already have this set to 10.

But vm.vfs_cache_pressure and vm.drop_caches were new to me.  So I thought, if we can drop the caches right after the backup, and give a little more priority to the inode/dentry caches - this should reduce our I/O wait.

So, I set vm.vfs_cache_pressure to 50 ( sysctl -w vm.vfs_cache_pressure=50 ), and in my backup script, added "echo 3 > /proc/sys/vm/drop_caches" to the end.  The results are quite dramatic.  So far this morning, I have an average I/O wait of  0.19%, down from 2-3%!! (and system is very zippy.)  This graph gives a good visual on the memory usage...

Notice: Everyday at 00:00.. this is backup kicking in.. Tues/Wed shows the slow reclaiming of buffers.  Thurs is where I drop the caches right after backup.. now caches reach previous days peak by 8am!!

Could this be the end of my users complaining about a slow system??   Doubt it... They never notice when its fast.. but always and only notice when its slow.

Update:  After two days now.. amazing.. average daytime I/O wait is now < 0.10%