Monitoring is pretty straightforward – green means all is well. But getting to this point can be a whole different ball game • The Register

Functionality The follow-up seems easy in principle. There is nothing particularly complex about the software or the protocols it uses to query systems and send alerts, nor deciding what to monitor or configuring the product you choose.

Yet, while it’s common to find a tracker that’s pretty well done, it’s very rare to find it brilliantly.

The basic thing you want from your monitoring is to show that when everything is working within tolerance, all indicators are green. It seems obvious, but it is far from easy to achieve.

Let’s look at a scenario in a company that this correspondent worked with a few years ago that had a really impressive oversight regime.

The network manager wanted to perform overnight maintenance, which would cause global WAN links to go down to the main data center of a 1,200-person company with a dozen sites on four continents and a basic setup with at north of 1000 virtual servers. He emailed the night watch crew warning them that certain alarms would occur, powered off a handful of key systems, spent 90 minutes making required changes, powered everything back on , watched until everything on the management console turned green, emailed the night shift saying “done” and I went home.

If you can get home at 1 a.m., confident that the green indicators mean you have nothing else to test, and the Delhi office will start working in an hour or two when you are just getting started to fall asleep, it’s fair to say that the follow-up was done well.

So how do you do it?

First of all, you must recognize that while change is essential in any business, it is the enemy of the supervisory regime. Ninety-nine point nine percent coverage in your monitoring is simply not enough – it is absolutely essential that any new device or system connected to the infrastructure be added to the monitoring regime. Miss a system and your surveillance infrastructure is immediately disabled. So before you even consider putting in any significant effort to get everything on the platform, develop and test your change policy and process for adding new systems and removing old ones. If the process is not absolutely ready when everything is being monitored, you will never be able to maintain 100% system coverage.

One of the easiest things to underestimate is the effort it takes to get everything into the monitoring platform. In our example, it took one person, dedicated to the task, a little over a year – and he certainly wasn’t lazy. More than 1000 servers, a dozen network kit sites, Internet routers, firewalls, Wi-Fi access points, telephone systems, the numbers add up. What if you don’t properly design monitoring before you start configuring it – protocols, SNMP credentials, alert levels, alert mechanisms, etc. devices.

As you add systems to the monitoring regime, you need more states than just “on” and “off” – that is, you can’t just switch from something. which is not monitored what it is being monitored because the inevitable result is that something will turn red at some point when you are halfway through configuring the item you just added. New items should display a state in which they exist and are visible (if they are not there, you cannot configure them) but do not contribute to status indicators or alerts. Once each item is completely configured and seems to be working fine, you can switch it to “live” mode.

The next reason this particular company has done such a good job of monitoring might be considered controversial by some: a lack of democracy. It’s hard to imagine someone with IT experience who hasn’t encountered multiple instances of people or departments wanting an exception: Legal departments claim that the data in their document management system contains material too confidential to risk letting the supervisor interview them; developers do not want their kit to be monitored, for vague and undisclosed reasons of “sensitivity” and “confidentiality”; the app’s support team insists that does not need us to monitor it because the company that supports it is already monitoring it; the R&D team does not want to give IT people access to their proprietary systems to connect to monitoring calls.

The answer should be a simple “no”. The story your monitoring console gives you should be unequivocal, and the moment you succumb to someone asking for an exception is the moment when everything goes wrong.

The preferred companies this correspondent worked for could all be categorized as “benevolent dictatorships”. Senior management (often the owner) would have an idea, ask their trusted lieutenants and techs for suggestions, usually tweak the idea based on what people said, and then say, “Okay, go ahead and say. me if someone tries to get in the way. ”If you are trying to supervise properly and are told that you need to write articles and persuade people to agree, try to think of something else to do that is less painful than banging your head against a wall.

Another reason why this company was successful is that despite having extensive infrastructure, it had a very talented centralized infrastructure team. While there were IT staff – sometimes several – in each of the offices around the world, the network, server, storage and monitoring systems were all managed centrally. It was truly a joy to work in such a way that you could get by without having to constantly coordinate remote teams to configure credentials on servers or routers, or to chase them when other tasks were a priority. Distributed support and distributed teams don’t preclude success, of course – you just have to accept that there will be a much greater coordination effort in this scenario.

How long is a string?

Perhaps the second most delicate part of managing your surveillance well – behind hitting the “100% gold” – is deciding how far you need to go with surveillance. The obvious starting point is system availability and performance – active / inactive state, disk capacity, memory usage, network link usage, LAN port errors – and these will be the starting point by default. . The precise measure in which you monitor is, however, largely a “How long is a piece of string?” question. And the correct answer to “How far are we going?” Is unique in each case.

In reality, we go up the layers as far as it takes to be happy that “all green” equals “all is well”. So for your main web server, for example, you wouldn’t stop at “Can I ping it and is the LAN interface passing traffic?” You will have synthetic transactions to test the web layer and the database that underpins it, preferably with remote agents on the internet somewhere to test that all layers are functioning properly remotely and not just internally. Of course, doing extensive testing in this form can cause the console to take several minutes to finally go completely green, but this is far better than having to do manual testing or skimp on monitoring for the sake of time. . And in the majority of cases, the console will be a steady progression of red spots turning green one by one, rather than waiting 15 minutes without anything happening, then (hopefully) BAM! – a sea of ​​success.

There is one last thing you need to do to trust your monitoring, and it’s something that is so often seen to be overlooked or at least done poorly: testing. Most of us just don’t test our surveillance properly.

Why? Simple: to test it correctly is to break things. If you want to conclusively prove that the complex monitoring of your CRM system is set up correctly, the only absolutely sure way to do it is to fail the system in any way possible. Yes, one can implement artificial means to emulate a failure, but there can still be that lingering doubt as to whether they are really absolutely correct and whether an actual failure will look like the monitoring agent.

While it is unrealistic to ask for additional downtime on the main systems, you can be a little smart and work with the owners of those systems to persuade them to let you do your surveillance tests the next time you go. ‘they will have a period of downtime. It’s a great stopgap approach because you can try out scenarios for real, but no one should beg the business for more downtime than they want – or force consumers if they want. It is a customer-oriented system.

Monitoring therefore requires rigor and determination, exhaustiveness and testing. But when you understand how it should be, and when you see that it stays as it should be, it’ll completely change your attitude. ®


Source link

About Jon Moses

Check Also

Steam Deck might not run some of the more popular games on PC, but what about the fighting game selection?

Fortunately, that seems like good news at the moment. Valve decided to jump into the …

Leave a Reply

Your email address will not be published. Required fields are marked *