Aller au contenu principal

Stuck Processes

Thread needs solution

Seems like every other night our backups wind up getting stuck in progress. There doesn't seem to be a pattern. I kill the processes manually (stopping the activity does nothing), then run the orphaned process removal tool that a support rep gave to me at one point, but it happens again a few days later.

We are using a centralized storage vault. I also have to stop the storage node service and restart it most of the time, it keeps saying that the storage node is busy when I go to check on it via B&R console, but the NAS itself is still fully functional and running. I kill the storage node service, because the service hangs when I try to stop it properly, then I have to re-input the username and password for our NAS to reconnect.

I need this fixed. What troubleshooting steps can I take to get this working properly? I am NOT running on the latest build I am still running 17311 because there was nothing in the release notes mentioning anything about stability or other related issues, the remote install does not work, and I have around 70 machines to deal with.

0 Users found this helpful

Kevin,

Some more info, please:

Your management server: CPU, RAM, network speed.
Hosts being backed up: Any with total disk contents greater than 500G?
Backup tasks: Are all 70 machines being backed up at the same time? Is catalogging turned on? Is validation occurring immediately after backup? What priority? What compression?

Try waiting. Storage nodes do lot's of things and won't tell you (e.g. checkpointing the dbs). So it's not uncommon for a SN to take 5-6 hours to react to a shutdown request.

Intel Xeon 2.5GHz
3.99 GB RAM
GigE nic
This storage server hosts about 30 of the machines. They start backing up over a 1 hour interval.
Cataloging is taking place after backup.
Validation runs on weekends.
Default priority.
No compression.

I originally thought that was the problem as well, the first couple times this happened was on the old build (211) and the issue was that it was stuck deleting old data. In this case many of the systems were stuck at 36% - 90% for well over a day.