Friday, July 20, 2007

Adiós, amigo

It was last day for Anil at our office. I am really going to miss this guy. I think I complemented this guy very well. It's very important to have such a partner specially when working in Product support environment.

Anil, Thank you very much for all the things you did for me and for all the times we have worked together. I will always miss you.

All the best in your next endeavors.

PS: I will learn to play snookers and lets play someday when I visit your place.

JVM Crash

In my view JVM crash is the most dreaded problem that could ever happen to an App server on production machine. Unfortunately, one of our clients production machine has been crashing regularly with a certain version of JDK at high concurrency, We have looked at few crash reports and determined that the crash appears to be happening in one particular piece of code.

Our client being Premium partner with Sun was able to take it up with Sun Support Team. They needed some info from our team as well. So a conference call was setup.

This is how it went on

Sun: We have looked into the crash reports. But would like to know if there is any more info you could provide.
Me: Sure( I ended up saying the changes that we made).
Sun: Anything else?
Me: We might have lot of things to say. But what is that you are looking for?
Sun: I am just trying to get more info on the problem as there seems to be nothing in the logs. How did you determine that it was a problem with JDK?
Me: It crashed and produced a crash report which we shared with you, You being the developers for this JDK should be able to say more about the crash and why t happened
Sun: I do not see any specific info from the logs that you have sent
Me: But we see that there is this crash that always happens in a compiler thread.
Sun: Oh okay, that is good info. Let me forward this to my analysis team. But can you tell me where you got this info from?
Me: Did you happen to have a look at the crash logs? It says so in the logs.

After a day Sun team has come up with the outcome of the analysis. I thought it was impressive. The outcome was that there was a StackOverFlow during class-native compilation(Thanks to Rajiv for taking pains to explain me about this compilation).

There were few params that were suggested. One of them was to increase the compilerthreadstacksize. We have tried with few options 1024,2048 but to no avail. We had to go back to Sun team to report about our unsuccessful attempts. So there was one more angle that was brought into the picture. There might be some recursion in the code due to which the stackoverflow was happening. Well that sounded logical to me. But where was this happening? Since the current stacktrace in the jvm crash reported at jvm.dll, I am convinced to believe that it was happening somewhere in the native code of JVM. But the Sun team had to differ here. We wanted to know how we can check where this is happening.(All these were through email correspondenses).

Next day there was an email from the Sun team in which they have provided one way to check where the StackOverFlow was happening.

"We would like you to capture a thread dump before the crash so that we can analyze the issue. For accuracy, it would be really good if you can capture at least 3-4 thread dumps."

Whoa!!! How do I capture a thread dump before the JVM crash? I need some real Oracle to help me out in predicting the time of crash so that I can capture the thread dump before the crash!!

Well said SUN!!!!

Thursday, July 19, 2007

HP ServiceGuard

Been busy last couple of weeks. Some really nice things happened during these weeks. We had an opportunity of integrating Pramati Server with HP Serviceguard. We had one of our customer who was looking for clustering solutions. We have offered Pramati Cluster, which offers fail over, and load balancing. The End User had HP machines for production enviroment. With these HP machines he happened to purchase the Serviceguard framework which manages the fail over mechanism and manages switching of IP address( virtual IP address). We had setups with OS level clustering such as Windows Clustering, Sun Clustering. However, these happened to manage the things at a machine level. Well it really depended on how you are trying to configure it. Generally these run with Active-Active or Active-Passive configurations. Active-Active means that both the machines are in active state and the data replication happens on both machines and the load is balanced between the two machines. This is achieved by using a Virtual IP address that forwards the traffic to back end machines. In Active-Passive configuration, only one machine is active at any time and all the traffic that hits virtual IP address is routed to the active machine.

However, the HP Serviceguard was managing the thing at package level. For this service any application registered with it is a package and it manages the package between the cluster nodes. That is to say App server can be running on one cluster machine and Database on the other. These two are independent and could be running anywhere on the cluster machines. With this background, I assume it is now safe to go into the details of what happened during this integration.

The End User has called up asking for few queries on how Pramati Server can be fit into the picture. Pramati Server has clustering solution which works independent of OS level clustering. We have proposed the same. However since there is no single point of entry for the traffic for cluster nodes, we were left with either using a loadbalancer or leveraging on the HP Serviceguard framework to manage the traffic routing. The Application vendor was in favor of leveraging on the existing HP Serviceguard framework. Hence, there were series of conference calls setup with HP implementation team, the Application vendor and us.

This is where the real fun has begun.

The following is the snapshot of the conversation that took place between the HP implementation guy at the clients place and me:

With all introductions done…

Me: How does this HP Serviceguard thing work? ( Though I have done some ground work on the HP Serviceguard thing, couldnot find any relevant docs on how the applications should interact with it).

HP: The HP Serviceguard has to register your application and a virtual IP address configured for your application.

Me: Okay, how do we register the application with HP Service?

HP: You will have to provide us with few scripts using which we would register your app into HP Service.

Me: (I was happy to hear this. Good just few scripts and its all done). Okay, what would these scripts be and what is the desired functionality of the scripts.

HP: I am not really sure, but all I know is that you will have to provide me few scripts.

Me: ( What the !!!!). Okay, if you can tell us what these scripts should be doing, we might quickly put up few scripts for the desired functionality.

HP: ( He repeats the same thing). I am not really sure, but all I know is that you will have to provide me few scripts.

Me: ( Now I am beginning to worry. This is not going to end soon). Okay, then who would know about what kind of scripts are required?

HP: The HP Serviceguard team would know.

Me: Are you from the HP team or a reseller of the product?

HP: I am from HP team, but from implementation team. So I do not know what kind of scripts. All I know that is few scripts are supposed to be provided by you.

Me: (Does any shell script do? Such as the one to display simple helloworld on the Console?) Okay, can you give me numbers of your HP Service team so that I can talk to them?

HP: You wouldn’t be able to talk to them directly without any case id.

Me: Okay, can we create a case for this and then talk to them.

HP: Sure, we should be able to do that. Shoot across an email on the info required and I will get back to you.

So I shot across an email to this guy and waited for a day. Nothing happened on it. So decided to call up and check what’s happening:

Me: Looks like we haven’t got a reply from your team. Since we have logged a case, can we call them up and check with them?

HP: Yes, but I do not have the numbers for the HP Serviceguard team.

Me: Okay, how do we get this?

HP: Can you call up HP Sales team and check with them?

Me: (Sigh….) Okay, I will call them up and check.

Now I call up HP sales team.

Me: Hi, this is ….. We have one customer who is interested in integrating our App server with one of your product. HP Serviceguard. We have few clarifications. Can you help us?

HPSales: I can provide you with HP Support number who should be able to help you

Me: Great.

I call up this number

Me: ( I ended up speaking few minutes about the current situation and what we are looking for).

HP Support: Sure Mr Naveen. Before we can start with any of your queries, can I have the serial number of the machines?

Me: Sure, we have few HP machines at our place. So will the number from any of them do?

Here comes the ace..

HP Support: No, the serial numbers should be of the machines on which HP Serviceguard framework was purchased.

Me: Okay I will get back with these numbers

Now I call up this HP guy at the clients place and ask him for the numbers. I asked him if he can give me the serial numbers. For some strange reasons he was reluctant to give me these numbers.

Finally with some intervention from Application Vendor and End User, we could get a sample script that was used to integrate MySql with HP Serviceguard. So we just mimicked these scripts and provided them to this HP implementation guy. After a day we got a call from our App Vendor saying all went well and Pramati Server has been registered with the scripts provided by us. One more happy customer.

But I really feel that HP Serviceguard is the one that provides clustering solution, they are supposed to have some documentation on what is required from applications such as App servers, database etc. It should have published its API if any and should be a part of the software that they sell. I wonder why it is not the case.