// archives

Sys Admin

This category contains 33 posts

Disk performance, Opterons and Smartraid 5.

I wrote to the Linux Kernel Mailing List earlier this year complaining that I was getting shocking performance from a Smartraid V card, in an Opteron server, running Linux 2.6. It’s worth noting that performance seems much better if installing a 32 bit distribution and kernel is attempted.

I had a similar experience yesterday trying to recover data from an identical box, this time using the i2o_block driver rather than dpt_i2o.

At seemingly random times, the cpu would jump to 100% utilisation in io_wait, and the machine would seem to lock up as before. This happened after installing 2.6.18.1 too. After 10 minutes, the machine would spring back to life, and the cpu would be mainly idle again.

Eventually I spotted that trying to rsync some files off this computer were ‘always’ hard, and some were ‘always’ easy.

Then I noticed that the device was 52% fragmented ! It looks like this was the reason behind the bad performance. Where the slightest bit of fragmentation was present, IO performance went out of the window.

With cheap cards like the (from lspci) 0 RAID bus controller: Adaptec (formerly DPT) SmartRAID V
Controller (rev 01)
, and the AMD Opteron Processor 246, ext3, and linux i2o_block or dpt_i2o, the only way not to get depressing performance is to hugely overspec the space requirements on your drive, and keep an eye on how fragmented the partitions are.

Additionally, we have no problem with Adaptec AAC-RAID cards running 64 bit linux, and strongly recommend these over the i2o_block supported cards.

Are security risks over-hyped?

According to this week’s Computer Weekly, 79% of ‘top IT professionals’ surveyed by recruitment consultant PSD think that IT security risks are over-hyped.

All IT support desks suffer a similar problem – very few people notice when everything is going well, but everyone notices when something is going wrong. Security support desks will suffer a similar fate.

If IT security risks received enough exposure, then corporate and desktop computers would all be patched up to date in order to prevent identity theft, the seizure of trade secrets, applications would be designed to prevent customer data loss, spam botnets would not exist, and corporate defacement attacks would not happen.

The article offers some kind of explanation, “IT recruiter Mark Sullivan said, ‘The trouble with security threats is that there has not been a massive attack on the internet recently. IT security prevents large losses from happening and maybe that is not put across strongly enough.’ “.

If this is true, then this approach will mean we suffer another large, headling grabbing security breach at a major firm. If this happens in the world of e-commerce, then it will continue to frighten people from shopping on-line, and this is bad news for everyone in the industry.

Handling timeout properly

I love Sysadmin magazine, it has been my first exposure to lots of really good technologies. Which is why when I come across an oversight, I think it’s a terrible shame. I think this month’s Shell techniques article on handling timeout is missing the most useful technique – the unix alarm signal.

An alarm signal allows you to send a special signal to a running process after a given length, in order to handle the unintended or negative effects of time passing, such as users not interacting with your application in an intended time, or a network or inter-process communication failing to receive a response.

Handling an alarm signal in bash is the same as handling other signals – such as safely handling the stop signal which interrupts a process when you hit control and c.

You ‘catch’ an alarm signal in a bash script using ‘trap’ :

#!/bin/bash
sleep 5 && kill -s 14 $$ &
trap timeup 14

timeup()
{
echo “Time up .. abort!”
exit 1
}

for i in 1 2 3 4 5 6 7;
do
echo $i
sleep 1
done

Here the ‘sleep’ command at the top that runs in the background forces the script to be completed within 5 seconds, otherwise an alarm event will be sent to the script. This is useful, it allows you as a sysadmin to create a script which handles a timeout differently to a deliberate introduction. A block which catches signal 2 demonstrates how this can look :

factory:~ andy$ sh dingdong.sh
1
2
^Cinterrupted !
factory:~ andy$ sh dingdong.sh
1
2
3
4
5
Time up .. abort!

Changing the first ‘sleep’ line to a period greater than the time taken to currently run this script (e.g. 10 seconds) allows it to finish without one of these signals being generated.

factory:~ andy$ sh dingdong.sh
1
2
3
4
5
6
7

This technique is much more flexible than using read -t to handle timeout as it allows you to cause a more complex tidy-up procedure to be invoked when timeout is encoutered. You can also handle timeout of for example a dns lookup or wget, as well as user input.

A bad week for my disks

A power-cut took out the disk with my /home on it on my desktop at work (Seagate Baracuda SATA), something undetermined killed the disk in my laptop (Toshiba disk inside Powerbook), and now the disk in my home desktop has started making an interesting noise – a little like a puppy (Maxtor SCSI inside Ultra 60).

When buying disks (this is for internal use rather than resale, and we have standardised on Seagate SATA models at work), we have grown to expect (and try to cater for) a 2-ish% failure rate of new models. We actually expect a handful of disks to be DOA whenever we buy any. We don’t build any such expectations into orders of other components. This is before any disks die in service, which they tend to like to do.

Looking after desktop storage is relatively easy when you export user writable areas over a network. Raid mirroring, and decent disks go a long way to continuity of storage service. But here you’re really just masking the underlying problem – magnetic storage is terrible today. This is also no help for mobile computer users, and it seems disks inside mobile computers are most at risk.

Companies like Bitmicro are bringing drop in replacements for your typical desktop disks, that use solid state electronics instead of magnetic methods. They are slow, though. The Barracuda disk in my desktop brags sustained read speed of 150MB/s (though hdparm claims that’s closer to 80). The Bitmicro device is only capable of 28.

There is tonnes of opportunity for development in this area. If the storage can be battery backed, instead of relying on some high-end controller to take a write-cache, then great, we’re better protected from the sort of data corruption that can occur after a power failure. If transfer can get faster, then the lines between RAM and permenant storage starts to blur. Imagine if you simply had to cater for what we call swap today – this would make memory requirement planning and upgrades easy. Moreso if these next-generation disk drives are hot-swap..

But, for today, my experiences of the previous week exist to remind me – backups are really important. Take lots of them.

Taming Syslog-ng

Syslog-NG is a reimplementation of the traditional syslog protocol and application in unixy operating systems. Its use to systems administrators, is to allow companies/users the chance to fully centralise their logging in one place, which makes auditing, enforcing policy, compliance measuring, and gathering marketing statistics much easier.

Some potentially non-obvious hints to help fellow admins on the way :

– When I explained that the config file was a sprawling mess to a friend, he told me to think of it as a config style for a simple router. Messages are delivered, and match given conditions (a source, and optionally a filter). You can then bend these messages across different interfaces (or, logfiles). One ‘complete’ configuration needs to be made up from at least one message matching rule, and one logfile, so might look like this :

filter f_myapplication { facility(local3); };
destination l_myapplication { file (“/var/log/application.log”); };
log { source (s_all); filter (f_application); destination (l_myapplication); };

understand just how important the ‘flags’ rule is. I started logging one of our application logs without really realising what the effect of what catchall rules were doing. I found I was logging application notes in /var/log/hosts/hostname.log, /var/log/messages, and the application log :

log { source(s_all); destination(hosts); };

is a good catch all because it will almost certainly catch any log and file it in a sensible place (e.g. today’s notes from x host) – but when you are generating hundreds of GB of logs every day from your network, such a scattering of identical logs, is a bad situation. I now have application specific logs at the top of my file and specify a flags(final); in the ‘log’ line – this means ‘stop here’. It probably follows that the most regularly hit log lines need to be at the top of the list for performance enhancements.

syslog-ng can completely obliterate the requirement for log rotation. This is definitely exciting, as even if you are running syslog-ng locally, instead of spooling log notes across your managemenet lan. Rotation is a pain, your night-time scripts which do the rotation, and then send cluster-bombs of hangup signals might go wrong, catch-all configuration in tools such as logrotate need to be understood by everyone, and you may find you need to do your logfile compression at a different time of day to your rotation, which just creates extra scripting jobs.

Logging straight into the ‘rotated’ or archive position makes much more sense. A destination directive such as :

destination l_redline { file (“/var/log/application-combined/$R_YEAR/$R_MONTH/$R_YEAR-$R_MONTH-$R_DAY-application.log”); };

means an end to the overnight logrotate hour and compression – you can compress on an ad-hoc basis so that the maximum amount of uncompressed logs are available to you if you wish, and you always know that the logfiles will rotate at exactly the right moment, when the day changes (which is right in most of this company’s contexts..)

Building Perl modules on Solaris 10

Writing this article in the hope that it is useful to someone looking through google for the answer.

i installed gcc from sunfreeware.org. When you do this, you must also install perl from sunfreeware.org from sunfreeware to buiild cpan modules. Pur /usr/local/bin in your path before /usr/bin, or remove the vendor version of perl by pkgrm’ing the perl packages in /var/sadm/pkg

Trying to install HTML::Parser, I still suffered from lots of breakage :

/usr/include/sys/siginfo.h:259: error: parse error before “ctid_t”
/usr/include/sys/siginfo.h:292: error: parse error before ‘}’ token
/usr/include/sys/siginfo.h:292: error: ISO C forbids data definition with no type or storage class
/usr/include/sys/siginfo.h:261: error: previous declaration of `__proc’
/usr/include/sys/siginfo.h:398: error: conflicting types for `__fault’
/usr/include/sys/siginfo.h:267: error: previous declaration of `__fault’
[...]

There isn’t much information about fixing this. You need to replace the vendor headers with those from gcc :

cd /usr/local/lib/gcc-lib/sparc-sun-solaris2.10/3.3.2/install-tools
./mkheaders

Then it works.

Monitoring SOA Drift with Nagios

At a recent UKNOF meeting, Nominet spoke of a test that they had written to ensure that all of the .uk domain name servers were savng the same ‘version’ of the UK zone file.

Such a script would solve the SOA-drift problem that we have at work. We often find that dns zone change notifications do not get picked up when notifies are sent from the Unix to Windows dns architecture (which the Active Directory guys need us to do).

I have implemented my own version of this simple test for Nagios, written in Perl, and release it to the community in the hope that is us useful.

The script can be downloaded from http://www.andyd.net/media/check-dns-soamatch.pl.txt

The script needs to know which domain name is being monitored, and where you consider the ‘master’ dns server to be. The script then asks the master which other nameservers are specified in the zone with NS resource records, and then compares the SOA serial number of all specified nameservers, treating the serial offered by the master as ‘correct’.

The nagios check should be run as :

perl check-dns-soamatch.pl -q [master dns server] -n [domain name]

If you run the script from the command line, you can see what is going on by adding the -d command line option (debug).

When running in debug mode, if everything works you will see this :

$ perl check-dns-soamatch.pl -d -q 10.0.1.4 -n big-shop.com
Nameserver to query: ns0.hd.big-shop.com
Nameserver to query: ns1.hd.big-shop.com
Master serial number from 10.0.1.4 is 2006060604
Checking server … ns0.hd.big-shop.com
Serial number from ns0.hd.big-shop.com is 2006060604
Checking server … ns1.hd.big-shop.com
Serial number from ns1.hd.big-shop.com is 2006060604
Everything OK testing domain big-shop.com

If SOA drift has occured you will see this :

Nameserver to query: piggie.hd.big-shop.com
Nameserver to query: deliverance.hd.big-shop.com
Master serial number from 10.0.1.4 is 2006060610
Checking server … piggie.hd.big-shop.com
Serial number from piggie.hd.big-shop.com is 2006060609
Checking server … deliverance.hd.big-shop.com
Serial number from deliverance.hd.big-shop.com is 2006060703
piggie.hd.big-shop.com serves Serial 2006060609 not 2006060610 deliverance.hd.big-shop.com serves Serial 2006060703 not 2006060610 (2 errors).

The ‘stain’ options in the script can be ignored unless you are also running my http-post interface into Nagios (which is not formally released … but the code is on sourceforge).

Improvement suggestions welcome !

Book: Time Management for Systems Administrators

The key message of this book is that appropriate planning and prioritisation is the key to managing time effectively.

Limoncelli offers a system called ‘The Cycle’ which is simple to implement straight away and tune to your workplace and methods. Small changes to your daily plan will result in more time to finish interesting projects. The key messages I took from this book are :

  • Do not check your email in the morning as your first job.
  • Do spend the first five minutes of the day planning your todo list for the day. This replaces your ‘global’ todo list and should contain the ‘right’ amount of work for one business day. Plan time against each todo item, and then prioritise it. If the work cannot be done, ‘manage’ your todo list and move the item to tomorrow.
  • Do use a single personal organiser, and carry it with you everywhere. If using an electronic organiser, check different calendaring software, you may find Date Book 5 better than the built in software for the cycle.
  • Keep your organiser by your bed when you sleep – if you remember something important which prevents you from sleeping, you can record it and rest more easily (this helps me enormously).
  • Document procedures in a step by step manner. When the process is documented it can be automated or delegated. A wiki is a simple way to keep documents up to date and give other people access to the documentation.
  • Always use a ticket/job system to record work.
  • Break projects with several stages into each stage for your todo list.
  • Respond to user requests. This might be done automatically by your software.
  • Email is a single touch mechanism. Receive an email, reply to it, or create a job in your todo list.
  • Manage interruptions – if you work in a team, share the role of handling inbound calls, tickets, and monitoring alerts

This system helps you meet deadlines, and also create better deadline estimates.