“Service Unavailable?” Servers normal. Now what?

July 20th, 2009 by admin

What happens when you get a service unavailable alert but all your server monitors show normal state? How can you respond to such an alert?

When discussing this with one of Nolio’s customers, a large online bank operation, they told me that in the past, in cases of a ’service unreachable’ alert with green servers, they didn’t really know what the source of the problem was. The first thing they would do when troubleshooting this type of problems is look into their monitoring system and check if there are problematic servers, but in most cases the issue is in the application layer, where some services or processes are stalled.

Next, they would start manually going into each server to try and figure out where the application failure was – a time-consuming task since they had over 350 servers to check! They would go into each application tier and simulate an application connectivity to the next relevant tier. They needed to know exactly which application tier on each server was connected to each of the other servers from the next tier. It was very complicated to go through dozens or hundreds of servers to find the problem, especially when the operations team was under heavy pressure to get things up and running.

These days, using Nolio, they have created an automated process that “travels” through their different application tiers and initiates connectivity to the next appropriate tier. For example: “go into all the web servers and open a web service (URL) located in the relevant application server that services that web server. Then go into all the application servers and query the relevant DB servers.” This way they test where the problems are so they can find them in minutes.

Nolio Automation Center enabled this customer to create an automated process that discovers where the problems are, but they still needed to go into each problematic server and reset the problematic service. They wanted to automate that as well, so they modified this process to fix this problem automatically.

The addition was very simple. All they needed to do was reset the next tier of service. For example: if the automation process has found that a web server cannot initiate a web service from its relevant application server, it resets the web service in the application server. So when the URL initiation action fails in the web server, the other action that resets the web service is triggered in the application server.

Time savings to this customer by using automation in their problem resolution process are dramatic. The overall time it took the customer to create such a process in Nolio was just 20 minutes. Now the time it takes them to solve this situation automatically is 5 minutes, over all 350 servers they have! Before they started using Nolio, the entire operations team usually spent 1.5 hours of hysteric work doing the same thing. Now this process is handled by just one person

What’s next? They are now integrating this process into their application monitoring system. They want it to be activated automatically when the ‘service unavailable alert’ is triggered. This is also very simple to do, since the Nolio system exposes a web service and command line APIs to enable that.

Once this is implemented, they will get an email from the Nolio system that the process was activated and finished successfully and then they can close the alert ticket within few minutes after it was triggered, without any manual work or team hysteria.

Post written by Alon Eizenman, CTO, Nolio


Nolio Application Service Automation is a software platform for designing and executing automated application service workflows across the data center, enabling reliable, effective processes for the management of application change.

  • Share/Bookmark

Tags:

Leave a Reply