Anyone responsible for operating multiple websites or large-scale web-publishing enterprises, has to have asked themselves these questions:
- Are my sites up?
- Are they actually functioning correctly?
- Are we making any money?
- Where did I put the aspirins/Tylenol/emergency whiskey?
More importantly, other people in your organization are asking you (at least) the first three questions on a regular basis. Not being able to answer with authority leads to many a sleepless night, with occasional trips to the computer to poke at a few of the pages, to see if they are still there. This is no way to live.
There are several different types of automated website monitoring, system tracking, and alert systems available, and I'll be going through the pro and con of anything I find that looks either interesting or useful.
In addition, I'll touch on other topics related to this: SEO, statistics munging, and the Things that Make a Site Monitoring Service rock.
There is hope, then. Doing extremely repetitive, detail-oriented tasks thousands of times per hour is what computers excel at, and as mentioned, many people have come up with great services and pieces of software to do these things for you.
The problems will come when you have to set them up, make sure they are monitoring the correct things, and informing the right people if your systems are faltering ... and that is something that computers are NOT very good at. Furthermore, a computer-programmer-oriented approach to the actual configuration interface often results in almost as much stress as having unknown, flaky failures on your systems in the first place. Finally, the messages you get, and the information you see in your reports has to actually make sense ... few humans want to read enormous tables of numbers each day, and try to relate them to the ones they saw yesterday. The reporting interface has to show you the patterns that emerge out of the complex data that the monitoring software is generating, without overwhelming you.
Features of an Alert System
Alerts, or notifications, are the attempts of a monitoring program to notify someone that something anomalous is happening to the monitored system. There is no point in having an automated test of any page or system if there is no way to remedy problems that occur. For the most part, our systems' best response is to try to bring the potential problem to the attention of a human being.
Two features are crucial to making a notification system effective. The first, obviously, is to be able to get through to someone who can address and fix the problem, as soon after it is detected as possible. The second is to make some judgment about how serious the problem is, so that the humans who are being notified are not flooded with alerts that do NOT represent problems, a situation in which real notices will be ignored.
Email is the most common way of attempting to fulfill the first requirement. Most if not all modern web monitoring systems allow the webmaster to select particular email addresses who should be sent an automated message when anomalies of a particular type or severity are detected by the monitoring system. Obviously, this relies on the fact that the email service is working for that user, which can be a bit tricky if email service is being handled by the same machine(s) that are currently exhibiting problems. It is frustrating to wade through the smoking wreckage of a web server or database, finally restore it to some working function, only THEN to finally receive a bunch of messages that would have kept the wreckage from smoking so furiously in the first place.
At a minimum, the system that handles email for anyone who is in charge of dealing with a notification should be hosted on a different system from whatever web services are being monitored. A better solution would be to have problems escalate through several different types of messaging solutions, in order to maximize the chance of the alert getting through. The more sophisticated (also more expensive) web monitoring solutions offer several means beyond email to get in touch with responsible parties ... Instant Messenger, ICQ, Skype, SMS text messaging, pager or a call on the traditional telephone network are all options that I've seen advertised by these services.
The second requirement, that of intelligent filtering of notifications, is generally more complicated, and often requires some trial and error by the person responsible for configuring web monitoring before it works correctly. Obviously, the first step is to differentiate between alerts of different severity levels, and send each different one to someone who is qualified to handle the problem. A useful next step is to be able to differentiate messages based on the particular sub-system being monitored, so as to be able to tell a database administrator when the database has a problem, versus telling a webmaster when the web server is down, or the networking people when the machine is completely unreachable.
In the perfect world, the system monitor could actually fix the problem itself, and send you a happy 'dealt with it' message. Since this is unlikely to occur in the immediate future, we have to assume that various responsible humans will have to receive these messages. So an important feature of any web monitoring alert system is to do as much as possible to only intrude upon the attention of the user when something truly urgent is happening, to avoid 'message fatigue', in which the user gets so many messages from the monitoring system that they just automatically delete them, or assume that they are more of the same ... what a waste to have a truly important event missed because it is buried in a list of hundreds of 'system is ok' messages!
A simple way to keep from fatiguing the attention span (and flooding the email box) of the user is simply to remember some history about the set of alerts, and group them as a single event, like 'the web page changed significantly each of the last 20 times I looked at it', instead of sending 20 messages that say 'the web page changed'. Fewer messages mean that the ones the user does see are apt to receive more attention.
There is potentially even more value in having the alerted parties give feedback to the web monitoring system, allowing it to 'learn' when and how to alert the user, and in what circumstances. A mechanism for accepting feedback and learning from it can be as simple as offering the notified party a way to respond to the message, with the option of saying 'yes this is important', 'please do not bug me with this, it is normal', or 'do not bug me about this unless it occurs a lot'. As in any system that relies on explicit user feedback to configure itself, however, it requires a great deal of buy-in from all of the humans involved. After all, if nobody ever responds, the thing can't learn, and the whole exercise is an expensive and useless failure. And it requires patience on their part too, as it takes time and a certain amount of work by all parties involved before the system is fully responding to their preferences.
December 22, 2004 in Commentary | Permalink | Comments (0)