Welcome

Welcome to the corner of my site where I focus on monitoring ideas and issues. I’ve been working with monitoring software and techniques for over 12 years now (almost half of my IT career), which includes work using SolarWinds, Tivoli, Nagios, Zenoss, HP OpenView, Shinken, BMC Patrol, SiteScope and a bunch of other tools that I’m probably overlooking.

Below you will find posts that offer tips, tricks, techniuqes and philosophies related to getting monitoring done. Want to call me out about something I said? Comment away! Want to know what I think about a topic or application? Use the Contact page to shoot me a note.

Enjoy!

 

Time for a Monitoring Tool? What to Look For.

You might think that implementing a network monitoring tool is like every other rollout. You would be wrong.

This article originally appeared in TechTarget SearchNetworking

Oh, so you’re installing a new network monitoring tool, huh? No surprise there, right? What, was it time for a rip-and-replace? Is your team finally moving away from monitoring in silos? Perhaps there were a few too many ‘Let me Google that for you’ moments with the old vendor’s support line?

Let’s face it. There are any number of reasons that could have led you to this point. What’s important is that you’re here. Now, you may think a new monitoring implementation is no different than any other rollout. There are some similarities, but there are also some critical elements that are very different. How you handle these can mean the difference between success and failure.

I’ve found there are three primary areas that are often overlooked when it comes to deploying a network monitoring application. This isn’t an exhaustive list, but taking your time with these three things will pay off in the end.

Scope–First, consider how far and how deep you need the monitoring to go. This will affect every other aspect of your rollout, so take your time thinking this through. When deciding how far, ask yourself the following questions:

  • Do I need to monitor all sites, or just the primary data center?
  • How about the development, test or quality assurance systems?
  • Do I need to monitor servers or just network devices?
  • If I do need to include servers,  should I cover every OS or just the main one(s)?
  • What about devices in DMZs?
  • What about small remote sites across low-speed connections?

And when considering how deep to go, ask these questions:

  • Do I need to also monitor up/down for non-routable interfaces (e.g., EtherChannel connections, multiprotocol label switching links, etc.)?
  • Do I need to monitor items that are normally down and alert when they’re up (e.g., cold standby servers, cellular wide area network links, etc.)?
  • Do I need to be concerned about virtual elements like host resource consumption by virtual machine, storage, security, log file aggregation and custom, home-grown applications?

Protocols and permissions–After you’ve decided which systems to monitor and what data to collect, you need to consider the methods to use. Protocols such as Simple Network Management Protocol (SNMP), Windows Management Instrumentation (WMI), syslog and NetFlow each have its own permissions and connection points in the environment.

For example, many organizations plan to use SNMP for hardware monitoring, only to discover it’s not enabled on dozens –or hundreds — of systems. Alternatively, they find out it is enabled, but the connection strings are inconsistent, undocumented or unset. Then they go to monitor in the DMZ and realize that the security policy won’t allow SNMP across the firewall.

Additionally, remember that different collection methods have different access schemes. For example, WMI uses a Windows account on the target machine. If it’s not there, has the wrong permissions or is locked, monitoring won’t work. Meanwhile, SNMP uses a simple string that can be different on each machine.

Architecture–Finally, consider the architecture of the tools you’re considering. This breaks down to connectivity and scalability.

First, let’s consider connectivity. Agent-based platforms have on-device agents that collect and store data locally, then forward large data sets at regular intervals. Each collector bundles and sends this data to a manger-of-managers, which passes it to the repository. Meanwhile, agentless solutions use a collector that directly polls source devices and forwards the information to the data store.

You need to understand the connectivity architecture of these various tools so you can effectively handle DMZs, remote sites, secondary data centers and the like. You also need to look at the connectivity limitations of various tools, such as how many devices each collector can support and how much data will be traversing the wire, so you can design a monitoring implementation that doesn’t cripple your network or collapse under its own weight.

Next comes scalability. Understand what kind of load the monitoring application will tolerate, and what your choices are to expand when — yes, when, not if — you hit that limit. To be honest, this is a tough one and many vendors hope you’ll accept some form of a, “it-really-depends” response.

In all fairness, it does matter, and some things are simply impossible to predict. For example, I once had a client who wanted to implement syslog monitoring on 4,000 devices. It ended up generating upwards of 20 million messages per hour. That was not a foreseeable outcome.

By taking these key elements of a monitoring tool implementation into consideration, you should be able to avoid most of the major missteps many monitoring rollouts suffer from. And the good news is that from there, the same techniques that serve you well during other implementations will help here.  You want to ask lots of questions; meet with customers in similar situations, such as environment size, business sector, etc.; set up a proof of concept first; engage experienced professionals to assist as necessary; and be prepared — both financially and psychologically — to adapt as wrinkles crop up. Because they will.

IT Monitoring Scalability Planning: 3 Roadblocks

Planning for growth is key to effective IT monitoring, but it can be stymied by certain mindsets. Here’s how to overcome them.

This essay originally appeared on NetworkComputing.com

As IT professionals, planning for growth is something we do all day almost unconsciously. Whether it’s a snippet of code, provisioning the next server, or building out a network design, we’re usually thinking: Will it handle the load? How long until I’ll need a newer, faster, or bigger one? How far will this scale?

Despite this almost compulsive concern with scalability, there are still areas of IT where growth tends to be an afterthought. One of these happens to be my area of specialization: IT monitoring. So, I’d like to address growth planning (or non-planning) as it pertains to monitoring by highlighting several mindsets that typically hinder this important, but often surprisingly overlooked element, and showing how to deal with each.

The fire drill mindset
The occurs when something bad has already happened either because there was either no monitoring solution in place or because the existing toolset didn’t scale far enough to detect a critical failure, and so it was missed. Regardless, the result is usually a focus on finding a tool that would have caught the problem that already occurred, and finding it fast.

However, short of a TARDIS, there’s no way to implement an IT monitoring tool that will help avoid a problem after it occurs. Furthermore, moving too quickly as a result of a crisis can mean you don’t take the time to plan for future growth, focusing instead solely on solving the current problem.

My advice is to stop, take a deep breath, and collect yourself. Start by quickly, but intelligently developing a short list of possible tools that will both solve the current problem and scale with your environment as it grows. Next, ask the vendors if they have free (or cheap) licenses for in-house demoing and proofs of concept.

Then, and this is where you should let the emotion surrounding the failure creep back in, get a proof-of-concept environment set up quickly and start testing. Finally, make a smart decision based on all the factors important to you and your environment. (Hint: one of which should always be scalability.) Then implement the tool right away.

The bargain hunter
The next common pitfall that often prevents better growth planning when implementing a monitoring tool is the bargain-hunter mindset. This usually occurs not because of a crisis, but when there is pressure to find the cheapest solution for the current environment.

How do you overcome this mindset? Consider the following scenario: If your child currently wears a size 3 shoe, you absolutely don’t want to buy a size 5 today, right? But you should also recognize that your child is going to grow. So, buying enough size 3 shoes for the next five years is not a good strategy, either.

Also, if financials really are one of the top priorities preventing you from better preparing for future growth, remember that the cheapest time to buy the right-sized solution for your current and future environment is now. Buying a solution for your current environment alone because “that’s all we need” is going to result in your spending more money later for the right-sized solution you will need in the future. (I’m not talking about incrementally more, but start-all-over-again more.)

My suggestion is to use your company’s existing business growth projections to calculate how big of a monitoring solution you need. If your company foresees 10% revenue growth each year over the next three years and then 5% each year after that, and you are willing to consider completely replacing your monitoring solution after five years, then buy a product that can scale to 40% of the size you currently need.

The dollar auction
The dollar auction mindset happens when there is already a tool in place — a tool that wasn’t cheap and that a lot of time was spent perfecting. The problem is, it’s no longer perfect. It needs to be replaced because company growth has expanded beyond its scalability, but the idea of walking away from everything invested in it is a hard pill to swallow.

Really, this isn’t so much of a mindset that prevents preparing for future growth as it is something that’s all too often overlooked as an important lesson: If only you had better planned for future growth the first time around. The reality is that if you’re experiencing this mindset, you need a new solution. However, don’t make the same mistake. This time, take scalability into account.

Whether you’re suffering from one of these mindsets or another that is preventing you from better preparing your IT monitoring for future growth, remember, scalability is key to long term success.

Don’t Tell Me It’s Complicated

HT to my hero and writing inspiration Seth Godin. His post here got me started, and his style is something I have wanted to emulate for years now.


Please

don’t tell me that it – monitoring – is complicated.

Don’t tell me you’re a snowflake – unique in your need for 1200 alert rules.

Don’t tell me “but our company is different. WE create value for our shareholders. Not like your other clients.”

Don’t tell me you can’t do it because…

Because

I’ve been creating monitoring solutions for over a decade.

I’ve designed solutions that scaled to 250,000 systems, in 5,000 locations

I work at a company that has written millions of lines of code to do this one thing, and do it well.

So please Don’t tell me it’s complicated.

Tell me what you need. What you want. What you wish you could have.

And then LISTEN to what I have to say. Because I’ve seen this before. I’ve done this before. And it’s NOT complicated. It’s also not easy.

But it is simple.

The Four Questions: What is Being Monitored On My System?

This article originally appeared here on PacketPushers.net. I’m re-posting here for posterity.


In the last two posts in this series, I described two of the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? and Why didn’t I get an alert? My goal in this post is to give you the tools you need to answer the third question: What is being monitored on my system?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

Not so fast, my friend!

It’s 1:35pm. Your first two callers—the one who wanted to know why they got a particular alert and the one who wanted to know why they didn’t get an alert—are finally a distant memory, and you’ve managed to squeeze out some productive work setting up a monitor that will detect when your cell phone backup circuit will…

That’s when your manager ambles over and looks at you expectantly over the cube wall.

“I just met with the manager of the APP-X support team,” he tells you. “They want a matrix of what is monitored on their system.”

To his credit he adds, “I checked on the system in the Reports section, but nothing jumped out at me. Did I overlook something?”

This question is solved with a combination of foresight (knowing you are going to get asked this question), preparation, and know-how.

It is also one of the questions where the steps are extremely tool-specific. An agent-based solution will have a lot of this information embedded in the agent, whereas you will probably find what you need on an agentless solution like SolarWinds in a central database or configuration system.

Understand that this is a question that you can answer, and with some preparation you can have the answer with the push of a button (or two). But like so many solutions in this series, preparation now will save you from desperation later.

It’s also important to recognize that this type of report is absolutely essential, both to you and to the owner of the systems.

<PHILOSOPY>

I believe strongly that monitoring is more than just watching the elements which will create alerts (be they tickets, emails, pager messages, or an ahoogah noise over the loudspeakers. Your monitoring scope should cover elements which are used for capacity planning, performance analysis, and forensics after a critical event. For example, you may never alert on the size of the swap drive, but you will absolutely want to know what its size was during the time of a crash. For that reason, knowing what is monitored is essential, even if you won’t alert on some of those elements.

</PHILOSOPHY>

The knee bone’s connected to the…

In order to answer this question, the first thing you should do is break down the areas of monitoring data. That can include:

  1. Hardware information for the primary device – CPU and RAM are primary items but there are other aspects. The key is that you are ONLY interested in hardware that affects the whole “chassis” or box, not sub-elements like cards, disks, ports, etc.
  2. Hardware sub-elements – You may find that you have one, more than one, or none. Network cards, disks, external mount points, and VLANs are just a few examples.
  3. Specialized hardware elements – Fans, power supplies, temperature sensors, and the like.
  4. Application components – PerfMon counters, services, processes, logfile monitors and more—all the things that make up a holistic view of an application.

And now… I’m going to stop. While there are certainly many more items on that list, if you can master the concept of those first few bullets, adding more should come fairly easily.

It should be noted that this type of report is not a fire-and-forget affair. It’s more of a labor of love that you will come back to and refine over time.

I also need to point out that this will likely not be a one-size-fits-all-device-types solution. The report you create for network devices like routers and access points may need to be radically different from the one you build for server-type devices. Virtual hosts may need data points that have no relevance to anything else. And specialty devices like load balancers, UPS-es, or environmental systems are in a class of their own.

Finally, in order to get what you want, you also have to understand how the data is stored, and be extremely comfortable interacting with that system. Because the tool I have handy is SolarWinds, that’s the data structure we’re going to use here.

As I mentioned earlier, this type of report is push-button simple on some toolsets. If that’s the case for you, then feel free to stop reading and walk away with the knowledge that you will be asked this question on a regular basis, and you should be prepared.

For those using a toolset where answering this question requires effort, read on!

select nodes.statusled, nodes.nodeid, nodes.caption,
 s1.LED, s1.elementtype, s1.element, s1.Description
 from nodes
 left join (
select ’01’ as elementorder, ‘HW’ as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description
 from APM_HardwareAlertData
union all
select ’02’ as elementorder, ‘NIC’ as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description
 from interfaces
union all
select ’03’ as elementorder, ‘DISK’ as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description
 from volumes
) s1
on nodes.nodeid = s1.NodeID
 Order by nodes.nodeid , s1.elementorder asc, s1.element

Catch all that? OK, let’s break this down a little. The basic format of this structure is:

Get device information (name, IP, etc.)

Get sub-element information (name, status, description)

The key to this process is standardizing the sub-element information so that it’s consistent across each type of element.

One thing to note is that the SQL “union all” command will let you combine results from two separate queries – such as a query to the interfaces table and another to the volumes table. BUT it requires the same number of columns as a result. In my example, I’ve kept it simple – just the name and description, really.

The other trick I learned was to add icons rather than text wherever possible. That includes the “statusLED” and “status” columns, which display dots instead of text when rendered by the SolarWinds report engine. I find this gives a much quicker sense of what’s going on (oh, they’re monitoring this disk, but it’s currently offline).

Another addition worth noting is:

select ‘xx’ as elementorder, ‘yyy’ as elementtype,

I use elementorder to sort each of the query blocks, and elementtype to give the report reader a clue as to the source of this information (disk, application, hardware, etc.)

But what if I included data points that existed for some elements, but not for others? Well, you still have to ensure that each query connected by “union all” has the same number of columns. So let’s say that we wanted to include circuit capacity for interfaces, disk capacity for disks, but nothing for hardware, it would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption,
 s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
 from nodes
 left join (
select ’01’ as elementorder, ‘HW’ as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, ” as capacity
 from APM_HardwareAlertData
union all
select ’02’ as elementorder, ‘NIC’ as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
 from interfaces
union all
select ’03’ as elementorder, ‘DISK’ as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
 from volumes
) s1
on nodes.nodeid = s1.NodeID
 Order by nodes.nodeid , s1.elementorder asc, s1.element

By adding to the ‘ ‘ as capacity hardware block (and any other section where it’s needed, we avert errors with the union all command.

Conspicuous by their absence in all of this are the things I listed first on the “must have” list: CPU, RAM, etc.

In this case, I used a couple of simple tricks: For CPU, I’m going to give the count of cpus since any other data (current processor utilization, etc.) is probably not helpful. For RAM, the solution is even simpler: I just queried the nodes table again and pulled out JUST the Total memory.

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity
 from nodes
 left join (
select ’01’ as elementorder, ‘CPU’ as elementtype, c1.NodeID, ‘Up.gif’ as LED, ‘CPU Count:’ as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, ” as capacity
 from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
 group by c1.NodeID
union all
select ’02’ as elementorder, ‘RAM’ as elementtype, nodes.NodeID, ‘Up.gif’ as LED, ‘Total RAM’ as element, CONVERT(varchar,nodes.TotalMemory), ” as capacity
 from nodes
 union all
 select ’03’ as elementorder, ‘HW’ as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, ” as capacity
 from APM_HardwareAlertData
union all
select ’04’ as elementorder, ‘NIC’ as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity
 from interfaces
union all
select ’05’ as elementorder, ‘DISK’ as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity
 from volumes
union all
select ’06’ as elementorder, ‘APP’ as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, ” as capacity
 from APM_AlertsAndReportsData
union all
select ’07’ as elementorder, ‘POLLER’ as elementtype, CustomPollerAssignmentView.NodeID, ‘up.gif’ as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, ” as capacity
 from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
 Order by nodes.nodeid , s1.elementorder asc, s1.element

In this iteration I found that the data output from some sources was integer, some was float, and some was even text! So I started using the “CONVERT()” option to keep everything in the same format.

The result looks something like this:

word-image

I could stop here and you would have, more or less, the building blocks you need to build your own “What is monitored on my system?” report. But there is one more piece that takes this type of report to the next level.

Including the built-in thresholds for these elements increases complexity to the query, but also adds an entirely new (and important) dimension to the information you are providing.

More than ever, the success of this type of report lies in your knowing where threshold data is kept. In the case of SolarWinds, a series of “Thresholds” views (InterfacesThresholds, NodesPercentMemoryUsedThreshold, NodesCpuLoadThreshold, and so on) makes the job easier but you still have to know where thresholds are kept for applications, and that there are NO built-in thresholds for disks or custom pollers.

With that said, the final report query would look like this:

select nodes.statusled, nodes.nodeid, nodes.caption, s1.LED, s1.elementtype, s1.element, s1.Description, s1.capacity,
 s1.threshold_value, s1.warn, s1.crit
 from nodes
 left join (
select ’01’ as elementorder, ‘CPU’ as elementtype, c1.NodeID, ‘Up.gif’ as LED, ‘CPU Count:’ as element, CONVERT(varchar,COUNT(c1.CPUIndex)) as description, ” as capacity, ‘CPU Utilization’ as threshold_value, convert(varchar, t1.Level1Value) as warn, convert(varchar, t1.Level2Value) as crit
 from (select DISTINCT CPUMultiLoad.NodeID, CPUMultiLoad.CPUIndex from CPUMultiLoad) c1
 join NodesCpuLoadThreshold t1 on c1.nodeID = t1.InstanceId
 group by c1.NodeID, t1.Level1Value, t1.Level2Value
union all
select ’02’ as elementorder, ‘RAM’ as elementtype, nodes.NodeID, ‘Up.gif’ as LED, ‘Total RAM’ as element, CONVERT(varchar,nodes.TotalMemory), ” as capacity, ‘RAM Utilization’ as threshold_value, convert(varchar, NodesPercentMemoryUsedThreshold.Level1Value) as warn, convert(varchar, NodesPercentMemoryUsedThreshold.Level2Value) as crit
 from nodes
 join NodesPercentMemoryUsedThreshold on nodes.nodeid = NodesPercentMemoryUsedThreshold.InstanceId
union all
select ’95’ as elementorder, ‘HW’ as elementtype, APM_HardwareAlertData.nodeid, APM_HardwareAlertData.HardwareStatus as LED, APM_HardwareAlertData.Model as element, APM_HardwareAlertData.SensorsWithStatus as description, ” as capacity, ” as threshold_value, ” as warn, ” as crit
 from APM_HardwareAlertData
union all
select ’03’ as elementorder, ‘NIC’ as elementtype, interfaces.nodeid as NodeID, interfaces.statusled as LED, interfaces.InterfaceName as element, interfaces.InterfaceTypeDescription as Description, interfaces.InterfaceSpeed as capacity, ‘bandwidth in/out’ as threshold_value,
 convert(varchar, i1.Level1Value)+’/’+convert(varchar,i2.level1value) as warn, convert(varchar, i1.Level2Value)+’/’+convert(varchar,i2.level2value) as crit
 from interfaces
 join (select InterfacesThresholds.instanceid, InterfacesThresholds.level1value , InterfacesThresholds.level2value
 from InterfacesThresholds where InterfacesThresholds.name = ‘NPM.Interfaces.Stats.InPercentUtilization’) i1 on interfaces.interfaceid = i1.InstanceId
 join (select InterfacesThresholds.instanceid, InterfacesThresholds.Level1Value, InterfacesThresholds.level2value
 from InterfacesThresholds where InterfacesThresholds.name = ‘NPM.Interfaces.Stats.OutPercentUtilization’) i2 on interfaces.interfaceid = i2.InstanceId
union all
select ’04’ as elementorder, ‘DISK’ as elementtype, volumes.nodeid as NodeID, volumes.statusled as LED, volumes.caption as element, volumes.VolumeType as description, volumes.VolumeSize as capacity, ” as threshold_value, ” as warn, ” as crit
 from volumes
union all
select ’05’ as elementorder, ‘APP’ as elementtype, APM_AlertsAndReportsData.nodeid as NodeID, APM_AlertsAndReportsData.ComponentStatus as LED, APM_AlertsAndReportsData.ComponentName as element, APM_AlertsAndReportsData.ApplicationName as description, ” as capacity, ‘CPU Utilization’ as threshold_value, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Warning]) as warn, convert(varchar, APM_AlertsAndReportsData.[Threshold-CPU-Critical]) as crit
 from APM_AlertsAndReportsData
union all
select ’06’ as elementorder, ‘POLLER’ as elementtype, CustomPollerAssignmentView.NodeID, ‘up.gif’ as LED, CustomPollerAssignmentView.CustomPollerName as element, CustomPollerAssignmentView.CustomPollerDescription as description, ” as capacity, ” as threshold_value, ” as warn, ” as crit
 from CustomPollerAssignmentView
) s1
on nodes.nodeid = s1.NodeID
 Order by nodes.nodeid , s1.elementorder asc, s1.element

word-image

It’s 90% Perspiration…

While answering this question requires persistence, skill, and in-depth knowledge of your monitoring toolset, the benefits are significantly greater than for the previous two questions.

Done right, teams can use this report to validate that the correct elements on each device are monitored – nothing is left out, nothing which has been decommissioned is still there. And when an alert does trigger, it will be easier to understand where you can look for hints, instead of just clicking around screens looking for something interesting.

Stock up on your tea leaves, goat entrails, and crystal balls because in my next post we’re going to take a peek into the future by answering the question “What WILL alert on my system?”

The Four Questions: Why Didn’t I Get An Alert?

This article originally appeared here on PacketPushers.net. I’m re-posting here for posterity.


In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this post is to give you the tools you need to answer the second question:

Why DIDN’T I get an alert?

I introduced all of the questions in greater detail here. You can find information on the first question (Why did I get this alert) here

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, my goal is to provide information and techniques which can be translated to any toolset.

**************

Good Morning, Dave…

It’s 9:45am. You finally got your first caller off the phone – the one who wanted to know why they got a particular alert.

You hear the joyous little “ping” of a new email. (Never mind that the joy of that ping sound got old 15 minutes after you set it up.)

“We had an outage on CorpSRV042, but never got a notification from your system. What exactly are we paying all that money for, anyway?

Out of The Four Questions of monitoring, this one is possibly the most labor-intensive because proving why something didn’t trigger requires an intimate knowledge of both the monitoring in place and the on-the-ground conditions of the system in question.

Unlike the previous question, there’s very little preparation you can do to lessen your workload. For that reason, my advice to you is going to be more of a checklist of items and areas to look into.

I’m also going to be working on the assumption that an event really did happen, and that a monitor was in place, which (ostensibly) should have caught it.

So what could have failed?

What We Have Here is a Failure…

It’s a non-failure failure (it was designed to work that way)

Items in this category represent more of a lack of awareness on the part of the device owner about how monitoring works. Once you narrow down the alleged “miss” to one of these, the next thing you need to evaluate is whether you should provide additional end-user education (lunch-and-learns, documentation, a narrated interpretive dance piece in the cafeteria, etc).

  • Alert windows
    Some alerting systems will allow you to specify that the alert should only trigger during certain times of the day. If the problem occurred and corrected itself (or was manually corrected) outside of that window, then no alert is triggered.

  • Alert duration
    The best alerts are set up to look for more than one occurrence of the issue. Nobody wants to get a page at 2:00am when a device failed one ping. But if you have an alert that is set to trigger when ping fails for 15 minutes, 13 minutes can seem like an eternity to the application support team that already knows the app is down.

  • The alert never reset
    After the last failure, the owners of the device worked on the problem, but they never actually got it into a good enough state where your monitoring solution registered that it was actually resolved. This also happens when staff decide to close the ticket (because it was probably nothing, anyway) without looking at the system. A good example of this is the disk alert that triggers when the disk is over 90% utilized, but doesn’t reset until it’s under 70% utilized. Staff may clear logs and get things down to a nice comfy 80%, but the alert never resets. Thus when the disk fills to 100% a week later, no alert is cut.

  • The device was unmanaged
    It’s surprisingly easy to overlook that cute blue dot—especially if your staff aren’t in the habit of looking at the monitoring portal at all. Nevertheless, if it’s unmanaged, no alerts will trigger.

  • Mute, Squelch, Hush, or “Please-Shut-Up-Now” functions
    I’m a big believer in using custom properties for many of the things SolarWinds doesn’t do by design, and one of the first techniques I developed was a “mute” option. This gets around the issue with UnManage, where monitoring actually stops. Read more about that here (https://thwack.solarwinds.com/message/142288). With that said, if you use this type of option, you’ll also need to check its status as part of your analysis when you get this type of question.

  • Parent-Child, part 1
    The first kind of parent-child option I want to talk about is the one added to Orion Core in version 10.3. From that point forward, SolarWinds had the ability to make one device, element, application, component, group, or “thingy” (I hope I don’t lose you with these technical terms) the parent of another device, element, application, component… etc. Thus, if the parent is deemed to be down, the child will NOT trigger an alert (even if it is also down) but will rather be listed in a state of “unreachable.”

  • Parent-Child, part 2
    The second kind of suppression is implicit but often unrecognized by many users. In its simplest terms, if a device is down, you won’t get an alert about the disk being full. That one makes sense. But frequently an application team will ask why they didn’t get an alert that their app is down, and the reason is that the device was down (i.e., ping had failed) during that same period. Because of this, SolarWinds suppressed the application-down alert.

Failure with Change Control

In this section, the issue that we’re looking at changes either in the environment or within the monitoring system, which would cause an alert to be missed. I’m calling this “change control” because if there was a record and/or awareness of the change in question (as well as how the alert is configured), the device owner would probably not be calling you.

  • A credential changed
    If someone changes the SNMP string or the AD account password you’re using to connect, your monitoring tool ceases to be able to collect. Usually you’ll get some type of indication that this has happened, but not always.

  • The network changed
    All it takes is one firewall rule or rou gting change to make the difference between beautiful data and total radio silence. The problem is that the most basic monitoring—ping—is usually not impacted. So you won’t get that device down message that everyone generally relies on to know something is amiss. But higher-level protocols like SNMP or WMI are blocked. So you have a device which is up (ping) but where disk or CPU information is no longer being collected.

  • A custom property changed
    As I said before, I love me some custom properties. Along with the previously mentioned “mute,” there are properties for the owner group, location, environment (prod, dev, QA, etc.), server status (build, testing, managed), criticality (low, normal, business-critical) and more. Alert triggers leverage these properties so that we can have escalated alerts for some, and just warnings for others. But what happens when someone changes a server status from “PRODUCTION” to “DEV”? Likely, there can be an alert missed if an alert is configured to specifically look for “PROD” servers.

  • Drive (or other element) has been removed/changed
    I say drives because this seems to happen most often. IF your environment doesn’t include a “disk down” alert (don’t laugh, I’ve seen them), then volumes can be unmounted or mounted with a new name with amazing frequency. When that happens, many monitoring tools do not automatically start monitoring the new element; and the tools that almost never apply all the correct settings (like those custom properties). You end up with a situation where the device owner is completely aware of the update, but monitoring is the last to know.

  • The Server Vanished into the Virtualization Cloud
    The drive toward virtualization is strong. When a server goes from physical to virtual (P-to-V), it’s effectively a whole new machine. Even though the IP address and server name are the same, the drives go from a series of physical disks attached to a real storage controller to (usually fewer) volumes which appear to be running off a generic SCSI bus. Not only that, but other elements (interfaces, hardware monitors, CPU, and more) all completely change. Almost all monitoring tools require manual updating to track those changes, or else you are left with ghost elements that don’t respond to polling requests.

Failure of the Monitoring Technology

The previous two sections speak to educational or procedural breakdowns. But loath as I am to admit it, sometimes our monitoring tools fail us too. Here are some things you need to be aware of:

  • The element or device is not actually getting polled
    Often, this is a result of disks or other elements being removed and added; or a P-to-V migration (see previous section). But it also happens that an element simply stops getting polled. You’ll see this when you dig into the detailed data and find nothing collected for a certain period of time.

  • Polling interval is throttled
    One of the first things that a polling server does (at least the good ones) when it begins to be overwhelmed is to pull back on polling cycles so that it can collect at least SOME data from each element. You’ll see this as gaps in data sets. It’s not a wholesale loss of polling, but sort of a herky-jerky collection.

Polling data is out of sync
This one can be quite challenging to nail down. In some cases, a monitoring system will add data into the central data store using localized times (either from the polling engine or (horrors!) from the target device itself). If that happens, then an event that occurred at 9am in New York shows up as having happened at 8am in Chicago. This shouldn’t be a problem unless, as mentioned in the first section, your system won’t trigger alerts before 9am.

Failure Somewhere After the Monitoring Solution

As much as you might hate to admit it, monitoring isn’t the center of the universe. And it’s not even the center of the universe with regard to monitoring. If everything within your monitoring solution checks out and you are still scratching your head, here are some additional areas to look:

  • Email (or whatever notification system you use) is down
    One of the most obvious items, but often not something IT pros think to check, is whether the system that is sending out alerts is actually alive and well. If email is down, you won’t get that email telling you the email server has crashed.

  • Event correlation rules
    Event correlation rules are wonderful, magical things. They take you beyond the simple parent-child suppression discussed earlier, and into a whole new realm of dependencies. But there are times when they inhibit a “good” alert in an unexpected way:

    • De-duplication
      The point of de-duplication is that multiple alerts won’t create multiple tickets. But if a ticket closed and didn’t update the event correlation system, de-dup will continue forever.

    • Blackout/Maintenance windows
      Another common feature for EC systems is the ability to look up a separate data source that lists times when a device is “out of service.” This can be a recurring schedule, or a specific one-time rule. Either way, you’ll want to check if the device in question was listed on the blackout list during the time when the error occurred.

  • Already open ticket
    Ticket systems themselves can be quite sophisticated, and many have the ability to suppress new tickets if there is already one open for the same alert/device combination. If you have a team that forgets to close their old ticket, they may never hear about the new events.

Hokey religions … are no match for a good blaster, kid

After laying out all the glorious theoretical ways in which monitoring can be missed, I thought it was only fair to give you some advice on techniques or tools you can use to either identify these problems, to resolve them, or (best of all) avoid them in the first place.

An Ounce of Prevention…

Here are some things to have in place that will let you know when all is not puppy dogs and rainbows:

  • Alerts that give you the health of the environment
    Under the heading of “who watches the watchmen,” in any moderately-sophisticated (or mission critical) monitoring environment, you should have both internal and external checks that things are working well:

    • The SolarWinds out-of-the-box (OOTB) alert that tells you SNMP has not been collected for a node in xx minutes.

    • The SolarWinds OOTB alert that tells you a poller hasn’t collected any metrics in xx minutes.

    • If you can swing it, running the OOTB Server & Application Monitor (SAM) templates for SolarWinds on a separate system is a great option. If having a second SolarWinds instance watching the first is simply not possible, look at the SAM template and mimic it using whatever other options you have available to you.

  • Have a way to test individual aspects of the alert stream
    It’s a horrible sinking feeling when you realize that no alerts have been going out because one piece of the alerting infrastructure failed on you. Start by understanding (and documenting) every step an alert takes, from the source device through to the ticket, email, page, or smoke signal that is sent to the alert recipient. From there, create and document the ways you can validate that each of those steps is independently working. This will allow you to quickly validate each subsystem and nail down where a message might have gotten lost.

  • So wait… you can test each alert subsystem?
    A test procedure is just a monitor waiting for your loving touch. Get it done. You’ll need to do it on a separate system (since, you know, if your alerting infrastructure is broken you won’t get an alert), but usually this can be done inexpensively. Just to be clear, once you can manually test each of your alert infrastructure components (monitoring, event correlation, etc), turn those manual tests into continuous monitors, and then set those monitors up with thresholds so you get an alert.

  • Create a deadman-switch
    The concept of a deadman switch is that you get an alert if something DOESN’T happen. In this case, you set yourself up to receive an alert if something doesn’t clear a trigger. You then send that clearing event through the monitoring system.

    • Example: Every 10 minutes, an internal process on your ticket system creates a ticket for you saying, “Monitoring is broken!” That ticket escalates, and alerts you, if it has been open for 7 minutes. Now you have your monitoring system send an alert whenever the current minutes are evenly divisible by 5. This alert is the reversing event your ticket system is looking for. As long as monitoring is working, the original ticket will be cleared before it escalates.

about that pound of cure, now

Inevitably, you’ll have to dig in and analyze a failed alert. Here are some specific techniques to use in those cases:

  • Re-scan the device
    The problem might be that the device has changed. It could be a new firewall rule, an SNMP string change, or even the SNMP agent on the target device dying on you. Regardless, you’ll want to go through all the same steps you used to add the device. In SolarWinds, this means using the “Test” button under node properties, and then also running the “List Resources” option to make sure all of the interfaces, disks, and yes, even CPU and RAM options are correctly selected.

    • TRAP: Remember that List Resources will never remove elements. Un-checking them doesn’t do diddly. You have to go into Manage Nodes and specifically delete them before they are really gone.

  • Test the alert
    Are you sure your logic was solid? Be prepared to copy the alert and add a statement limiting it down to the single device in question. Then re-fire that sucker and see if it flies.

  • Message Center Kung-Fu
    The Message Center remains your best friend for forensic analysis of alerts. Things to keep in mind:

    • Set the timeframe correctly – It defaults to “today.” If you got a call about something yesterday, make sure you change this.

    • Set the message count – If you are dealing with a wide range of time or a large number of devices, bump this number up. There’s no paging on this screen so if the event you want is #201, out of 200, you’ll never know.

    • Narrow down to a single device – to help avoid message count issues, use the Network Object dropdown to select a specific device

    • Un-check the message categories you don’t want. Syslog and Trap are the big offenders here. But even Events can get in your way depending on what you’re trying to find.

    • Limiting Events (or Alerts) – Speaking of Events and Alerts, if you know you’re looking for something particular, use those drop-down boxes to narrow down the pile of data you need. This option also lets you look for a particular alert, but also see any event during that time period.

    • The search box. Yes, there is one. It’s on the right side (often off the right side of the page and you have to scroll for it). If you put something in there, it acts as an additional filter along with all the stuff in the grey box.

Stay tuned for our next installment where we’ll dive into the third question: “What is being monitored on my system?”

Frugal Fridays: Database Monitor

There are network guys who have to manage servers.

There are server guys who have to manage network devices.

And then those are those poor souls who are “accidental DBAs”  – usually IT generals who explained how SQL queries work once at lunch, and suddenly became “the database guy”.

And now some frantic application owner is pounding down his cube wall demanding to know whether the database is up or not.

Don’t let yourself get into this situation. The fact is that there are a lot of good (non-free) database monitoring and management tools. DBA’s of all stripes keep those companies in business.

But if you’re not a DBA, in a purchasing role, or independantly wealthy, you may need something in a pinch – meaning both time and money.

So I present to you the SolarWinds Database Monitor. For the low-low price of $0 you can monitor up to 20 database instances on MS-SQL, Oracle, DB2, or Sybase; and can monitor them whether they are on physical servers or VMWare instances (which causes some tools to have fits because of how storage is handled).


Nothing beats having the right tool for the job at hand. There are times in our work that we’re able to buy exactly what we need, and things go smoothly like they are supposed to. And then there’s the other 99% of the time.

Frugal Friday is a new feature I’m trying out where I feature a tool or utility which is 100% free. It may not do everything you need (heck, it may not do ANYTHING you need!) but for the price, you can’t beat it. As long as it’s not, you know, full of malware or anything. Click here to find more Frugal Friday Fun.

The Four Questions: Why Did I Get That Alert?

This article originally appeared in a scaled-down version here on PacketPushers.net. I’m posting it in it’s full form here as an introduction to the full series.


In the first part of this series, I described the four (ok, really five) questions that monitoring professionals are frequently asked: Why did I get an alert? Why didn’t I get an alert? What is being monitored on my system? What will alert on my system? and What standard monitoring do you do? My goal in this next post is to give you the tools you need to answer the first of those:

Why did I get an alert?

Reader’s Note: While this article uses examples specific to the SolarWinds monitoring platform, the fact is that most of the techniques can be translated to any toolset.

**************
It’s 8:45am, and you are just settling in at your desk. You notice that one email came in overnight from your company’s 24-7 operations desk:

“We got an alert for high CPU on the server WinSrvABC123 at 2:37am. We didn't notice anything when we jumped on the box. Can you explain what happened?”

Out of all of The Four Questions of monitoring, this is the easiest one to answer, as long as you have done your homework and set up your environment.

Before I dig in, I want to clarify that this is not the same question as “What WILL alert on my server?” or “What are the monitoring and alerting standards for this type of device?” (I’ll cover both of those in later parts of this series.) Here, we’re dealing strictly with a user’s reaction when they receive an alert.

I also have to stress that it’s imperative that you always take the time to answer this question. It can be annoying, tedious, and time-consuming. But if you don’t, before long all of your alerts will be dismissed as “useless.” That is the first step on a long road that leads to a CIO-mandated RFP for monitoring tools, you defending your choice of tools, and other conversations that are significantly more annoying, tedious, and time-consuming.

However, my tips below should cut down on your workload significantly. So let’s get started.

First, let’s be clear: monitoring is not alerting. Some people confuse getting a ticket, page, email, or other alert with actual monitoring. In my opinion, “Monitoring” is the ongoing collection of data about a particular element or set of elements. Alerting is a happy by-product of having monitoring, because once you have those metrics you can notify people when a specific metric is above or below a threshold. I say this because customers sometimes ask (or demand) that you fix (or even turn off) “monitoring.” What they really want is for you to change the alert they receive. Rarely do they really mean you should stop collecting metrics.

The bulk of your work is going to be in the way you create alert messages, because in reality, it’s the vagueness of those messages that has the recipient confused. Basically, you should ensure that every alert message contains a few key elements. Some are obvious:

  • The machine having the problem
  • The time of alert
  • Current statistic

Some are slightly less obvious but no less important:

  • Any other identifying information about the device
    • Any custom properties indicating location, owner group, etc.
    • OS type and version (the MachineType variable)
    • The IP address
    • The DNS Name and/or Sysname variables if your device names are… less than standard
  • The threshold value which breached
  • The duration – how long the alert has been in effect
  • A link or other reference to a place where the alert recipient can see this metric. Speaking in SolarWinds-specific terms, this could be:
    • The node Details page – using either the ${NodeDetailsURL} (or the equivalent for your kind of alert) or a “forged” URL (i.e.: “http://myserver/Orion/NetPerfMon/NodeDetails.aspx?NetObject=N:${NodeID}” )
    • A link to the metric details page. For example, the CPU average would be http://myserver/Orion/NetPerfMon/CustomChart.aspx?chartname=HostAvgCPULoad&NetObject=N:${NodeID}
    • Or even a report that shows this device (or a collection of devices where this is one member) and the metric involved

Finally, one element that should always be included in each alert:

  • The name of alert

For your straightforward alerts, this should not be a difficult task and can be something you (almost) copy and paste from one alert to another. Here’s an example for CPU:

CPU Utilization on the ${MachineType} device owned by ${OwnerGroup} named ${NodeName} (IP: ${IP_Address}, DNS: ${DNS}) has been over ${CPU_Thresh} for more than 15 minutes. Current load at ${AlertTriggerTime} is ${CPULoad}.
View full device details here: ${NodeDetailsURL}.
 Click here to acknowledge the alert: ${AcknowledgeTime}
This message was brought to you by the alert: ${AlertName}

While it means more work during alert setup, having an alert with this kind of messaging means that the recipient has several answers to the “Why did I get this alert?” at their fingertips:

  • They have everything they need to identify the machine – which team owns it, what version of OS it’s running, and the office or data center where it’s located.
  • They have what they need to connect to the device – whether by name, DNS name, IP address, etc.
  • They know what metric (CPU) triggered the alert.
  • They know when the problem was detected (because let’s face it, sometimes emails DO get delayed).
  • They have a way to quickly get to a status screen (i.e.: the Node details page) to see the history of that metric and hopefully see where the spike occurred.

Finally, by including the ${AlertName}, you’re enabling the recipient to help you help them. You now know precisely which alert to research. And that’s critical, because there’re more things you should be prepared to do.

There is one more value you might want to include if you have a larger environment, and that’s the name of the SolarWinds polling engine. There are times when a device is moved to the wrong poller—wrong because of networking rules, AD membership, etc. Having the polling engine in the message is a good sanity check in this situation.

Let’s say that the owner of the device is still unclear why they received the alert. (Hey, it happens!) With the information the recipient can give you from the alert message, you can now use the following tools and techniques:

The Message Center
Some people live and die on this screen. Some never touch it. But in this case, it can be your best friend. Note two specific areas:

  • The Network Object drop-down – this lets you zero in on just the alerts from the offending device. Step one is to look at EVERYTHING coming off this box for the time period. Events, alerts, etc. See if this builds a story about what may have led up to the event.
  • The Alert Name drop-down under Triggered Alerts – this allows you to look at ALL of the instances when this alert triggered, or further zero in on the one event you are trying to find.

Side Note: The Time Period drop-down is critical here. Make sure you set it to show the correct period of time for the alert or else you’re going to constantly miss the mark.

Using these two simple controls in Message Center, you (and your users) should be able to drill into the event stream around the ticket time. Hopefully that will answer their question.

If you do it right (meaning take your time explaining what you are doing in a meeting, or using a screen share; maybe even come up with some light “how to” documentation with screen shots), users—especially those in heavy support roles—will learn over the course of time to analyze alerts on their own.

But what about the holdouts? The ones where Message Center hasn’t shown them (or you) what you hoped to see. What then?

Be prepared to test your alert. It’s something you should do every time you’re ready to release a new alert into your environment. Also remember that sometimes you get busy, and sometimes you test everything, but then the situation on the ground changes without your participation.

So, however you got here, you need to go back to the testing phase.

  • Make a copy of the alert. Never test a live normal production alert. There’s a COPY button in the alert manager for that very reason.
  • Change the alert copy by adding an alert trigger for the machine in question. JUST that machine. (i.e.: “where node caption is equal to WinSRVABC123”).
  • Set your triggering criteria (“CPULoad > 90%” or whatever) to a value so low that it’s guaranteed to trigger.

At that point, test the heck out of that bugger until both you and the recipient are satisfied that it works as expected. Copy whatever modifications you need over to the existing alert, and beware that updating the alert trigger will cause any existing alerts to re-fire. So you may need to hold off on those changes until a quieter moment.

Stay tuned for our next installment: “Why didn’t I get an alert?”

Frugal Friday: Response Time Viewer for Wireshark

Wireshark itself is a great tool (which I’ll probably talk about later). But a great tool (not to mention a great FREE tool) is not the same as a good that’s necessarily easy to use.

That’s not WireShark’s fault. Packet capture and analysis is by it’s nature not easy. Which is why I’m starting with the solution to that challenge before I dig into the Wireshark itself.

The Response Time Viewer for Wireshark is a free utility that takes a packet capture file from Wireshark, and parses it to show the timing of each application or protocol.

Now let me unwind that a bit.

By the time you get to the Response Time Viewer, you will have installed Wireshark on a box and captured some traffic. You save that capture session into a file.

Then you load the file into the Response Time Viewer. What this utility does is look (primarily) at two calculations – the time to first byte and the TCP/IP three-way handshake.

Time to first byte tells you how long it takes for an application server (like your database or web server or SalesForce.com) to respond with data after the a request has been made.

The 3-way handshake is a standard series of packets sent to measure the timing from one device to another.

What these two measurements tell you is whether a slow user experience is due to the network being slow (3-way handshake) or the application itself (time to first byte).

While this isn’t the only reaason you would use wireshark, it’s one of the more challenging measurements to do – especially if you are new to the tool. So having the Response Time Viewer can make the job of analyzing packet captures significantly less painful.


Nothing beats having the right tool for the job at hand. There are times in our work that we’re able to buy exactly what we need, and things go smoothly like they are supposed to. And then there’s the other 99% of the time.

Frugal Friday is a new feature I’m trying out where I feature a tool or utility which is 100% free. It may not do everything you need (heck, it may not do ANYTHING you need!) but for the price, you can’t beat it. As long as it’s not, you know, full of malware or anything. Click here to find more Frugal Friday Fun.

Pregnant Pause

surpriseAs some of you know, my daughter is pregnant with my first grandchild (yes, the child has some unimportant relevance to my daughter and son-in-law too, but that’s irrelevant!).

We’re more or less a month out from the due date, so I decided to hold a little contest: email me at babyguess@adatosystems.com with your prediction for birthday (just day – not hour, minute, second, nanosecond, etc), plus weight and length.

I’m awarding your choice of any one item from the SolarWinds Thwack Store to whoever’s guess is closest to actual.

For those people who are Thwack members and also live somewhere outside of the USA, I know that you normally can’t buy Thwack stuff  because SolarWinds can’t ship to Europe, Asia, etc.

But I can.

So this may be your one chance to score a snuggie, messenger bag, or whatever else your Thwacky heart desires.

So get guessing!

Also, if you want to help ensure my grandchild is surrounded by geekery upon his or her arrival, my daughter has a wishlist set up on ThinkGeek.com.

Value, part 3

What is noise? What is a weed? What is poison? What is garbage?

Anything that we don’t like, that we can’t tolerate, that we’ve had our fill of, that we don’t (or no longer) value.

The perfect music played on the most well-tuned instrument by the most accomplished musician can be no better than “99 bottles of beer on the wall” if you have to listen to it often enough.

Meanwhile the simplest tunes croaked by a tone-deaf parent can be simply transcendent when sung to their child to comfort them.

A dandelion can be the sweetest flower. The bacteria that causes botulism can be used to calm tremors or cure migraines. And we all know what they say about one man’s trash.

What’s the point? Frequency.

Make sure as you craft monitoring – especially alerts – that you keep an eye to how often thresholds will trigger. Too much, and all you’ve done is created a lot of noise.

Because we deal with these types of issues all the time, I wanted to share a new version of a very old story. “Diamond Island” is a tale that’s been passed down through centuries and speaks to value systems, and how someone can get caught in the middle. You can read the first few chapters, and buy the book version if you want, here: DiamondIslandBook.com.


Note: In this post I’m trying something new: short thoughts that are meant to start you thinking rather than completely explain and answer a question – an inspiration kick-start, if you will. Let me know in the comments (or on twitter, or wherever you find me) if it’s working.