lunes, 25 de abril de 2011

ext3 commit overload

At CGA, we used to face up problems usually tagged as "weird".
Sometimes, these are implementation problems. 

Well, some Incident Management folks needs help with a server located at several
km. away from office. The server has a looong history of faulty services, mainly 
relationed with server performance.
Incident Management folks have been done a good work trying to debug the server
behaviour: switches, tagged ports, cable link, disk I/O ...  But the problem
still persist, and is affecting user's daily work. This turns this problem into a 
SLA problem!
We've two servers for each educational center -real or virtual-
- one for security services: firewall, DHCP, cache and content filtering
(via Dansguardian), etc
- another one for user data and user centric services:  Helvia, Moodle, and users 
home directory exported through
NFS exports and LDAP user profiles. 

Educational Centrers differs on users volume: from almost testimonial usage
to *real* intensive. This Educational centre have a high ratio of NFS usage,
and Incident Management folks ask me if I can help on find the servers poor
performance root.

Server c0 is a vanilla Debian 4.0, with nfs-kernel-server and ext3 disks. 
The problem is about disk contention: top command shows loadvg from 18 to 53.
That's scary: this values NEEDS be close to zero. 

 
Let's go:  
 
First round: Discarding network and hardware (RAID 5) problems, I've to see
 if filesystems need some tunning, mainly related to noatime fstab mount options.
 After study the situation, and have implemented the disaster recovery scenario,
 the most simple solution would be remount /home partition with noatime activated
. 
No results :-(
  
Second round: This is a typical high performance write error on journaled 
filesystems, but some of others filesystems (including commercial ones) don't have
a  good set of debugging tools. Fortunately, Linux do it.
I need to know what process is hitting on disk performance, so I  simply have
to activate the sysctl's vm.block_dump switch (accessible too via 
/proc/sys/vm/block_dump). BUT it's are a performance killer! Think on these: 
if the problem is that some process it's squashing disk performance, 
activating vm.block_dump means that EVERY disk use is logged to /var/log/syslog 
and /var/log/debug. OOOUch!
Keep in mind that this is a production server, with user data, and SLA time 
under fire!.
The best approach is create a RAM disk, and redirect syslogd.conf entries to
it, so logging don't affect to server performance.
These are an example of file contents:

kjournald(2008): WRITE block 8672240 on sda7
kjournald(2008): WRITE block 8671568 on sda7
kjournald(2008): WRITE block 4267496 on sda7
kjournald(2008): WRITE block 14760 on sda7
kjournald(2008): WRITE block 14768 on sda7
kjournald(2008): WRITE block 14776 on sda7
kjournald(2008): WRITE block 14784 on sda7
kjournald(2008): WRITE block 14792 on sda7
kjournald(2008): WRITE block 14800 on sda7
kjournald(2008): WRITE block 14808 on sda7
kjournald(2008): WRITE block 14816 on sda7
kjournald(2008): WRITE block 14824 on sda7
kjournald(2008): WRITE block 14832 on sda7
kjournald(2008): WRITE block 14840 on sda7
kjournald(2008): WRITE block 14848 on sda7
kjournald(2008): WRITE block 14856 on sda7
kjournald(2008): WRITE block 14864 on sda7
kjournald(2008): WRITE block 14872 on sda7
kjournald(1129): WRITE block 31440 on sda1
pdflush(166): WRITE block 64 on sda6
pdflush(166): WRITE block 136 on sda6
pdflush(166): WRITE block 184 on sda6
pdflush(166): WRITE block 224 on sda6
pdflush(166): WRITE block 280 on sda6
pdflush(166): WRITE block 336 on sda6
pdflush(166): WRITE block 352 on sda6
pdflush(166): WRITE block 472 on sda6
pdflush(166): WRITE block 504 on sda6
pdflush(166): WRITE block 600 on sda6
pdflush(166): WRITE block 664 on sda6
pdflush(166): WRITE block 728 on sda6
pdflush(166): WRITE block 880 on sda6
pdflush(166): WRITE block 4560 on sda6
pdflush(166): WRITE block 6144 on sda6
pdflush(166): WRITE block 6272 on sda6
pdflush(166): WRITE block 6328 on sda6
pdflush(166): WRITE block 262232 on sda1
pdflush(166): WRITE block 262264 on sda1
pdflush(166): WRITE block 262296 on sda1

As you can see, the problem is write contention. pdflush and kjournald,
 by default, commit data to disk every 5 seconds. If the server is under heavy 
load, these 5 sec. may not to be sufficient to write all data to disk, so more 
contention happens!
 
Well, the real problems was NFS's sync mount option on exported filesystem. Every cliente who
needs write data to server disk was causing several fsync() calls, causing I/O penalty while 
writing info to disk. The solution was simple: exports NFS's filesystem with async option enabled
 
Voilá! Problem solved.  
 
 

 
 

No hay comentarios:

Publicar un comentario