Skip to main content

vmProtect 8 stuck tasks (what causes this?)

Thread needs solution

It seems that we are starting to run into a problem with vmProtect 8 backup tasks getting frozen at a particular percentage. The web gui indicates jobs are running and they have remaining time, but the time does not change and the percentage never increases on it's way to completion.

When we look in vCenter under Recent Tasks there is nothing there.
The last line on 3 of these tasks is right after remove snapshot and it says Backup deletion. On a few others like one for example I have:
20:30:02 task was started
06:34:54 task changed its state from 'waiting' to 'idle'.
06:34:54 task was cancelled. (yet no one cancelled it).
Shows start time 20:30, 0% progress, remaining time N/A

Here's another
21:00:13 Backup deletion
21:00:02 task was started
21:00:02 task changed its state from 'idle' to 'running'
06:34:52 task changed its state from 'stopping' to 'idle'
06:34:52 task was cancelled.
06:21:17 task changed its state from 'running' to 'stopping' (I know this is out of order but that is how its displayed in vmProtect 8 web gui!
Shows start time of 21:00, 0% progress remaining time N/A

Another:
19:45:04 Backup deletion
19:45:02 task changed its state from 'idle' to 'running' (again out of order!)
19:45:02 task was started
06:34:53 task changed its state from 'waiting' to 'idle'.
06:34:53 task was cancelled.
Shows start time of 19:45, 0% progress remaining time N/A

Two other servers show a progress of 90%. One has a remaining time of 33 min and another 25 min. Though there actually is no progress. The last thing in the log is backup deletion 23:29:16 for one and 21:47:05 for the other.

Attachment Size
acronis-frozen-tasks.jpg 79.99 KB
0 Users found this helpful

I made this post at 9:41 AM EST. It is now 5:23 PM EST. All tasks in that screen shot are exactly stuck in the same place. Still an indefinite 33 minutes left on the first one, N/A on the next to and 25 min on the next one then N/A on the last one.

I am updating all ESX 4.1 from build 874690 to build 988178 (31-jan-2013). I will reboot the acronis appliance and see if things become more stable.

Ok after updating all the ESX hosts to build 988178 and restarting vmprotect 8 virtual appliance...
Today more backups suceeded, however there are two outstanding issues.

1. One machine is stuck at 5%, Started 02/21/2013 21:45 Remaining time: 11 min... However its not moving. In vSphere client under recent tasks it shows Acronis vmProtect Backup In Progress. If I edit the Acronis Appliance it shows a disk attached to it that has to do with this machine that is stuck. In Datastores > Storage Views tab for this datastore, I'm showing 344.50 MB of snapshot space being used up. However nothing is moving so How would I cancel this job? The last line in the log for this job is Analyzing partition '1-0'...

2. Next issue, the Exchange backup of our mail server failed. Here is the log:
Task 'MAIL' failed: 'Failed to create a backup.
Additional info:
--------------------
Error code: 3
Module: 435
LineInfo: 555b5abba09501ce
Fields:
Message: Failed to create a backup.
--------------------
Error code: 32786
Module: 114
LineInfo: 28314c961de7d32e
Fields:
Message: Failed to prepare for backing up.
--------------------
Error code: 353
Module: 149
LineInfo: a71592046cb2c5f6
Fields:
Message: Failed to back up the group.
--------------------
Error code: 2
Module: 218
LineInfo: 338a407ad20e0987
Fields:
Message: Error occurred while running the backup and recovery engine.
--------------------
Error code: 368
Module: 149
LineInfo: 4db9605401c15283
Fields:
Message: Failed to prepare for the Microsoft Exchange database validation.
--------------------
Error code: 358
Module: 149
LineInfo: 8e1d384601b2ab4b
Fields:
Message: VSS metadata is missing or corrupt.
--------------------
Error code: 13
Module: 4
LineInfo: 86137b9d60c180c7
Fields:
Message: The file is corrupted.
--------------------'.

I don't know why I keep having so many issues with this software. Would it be better to scrap the whole thing and install a windows agent on a brand new Windows Server 2008 R2 machine?

frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

As far as I can see you are running multiple tasks at the same time on one appliance and it seems to be pretty much loaded. There are 5 tasks allowed to be run simultaneously and all tasks which exceed this number will be put into 'waiting' state and will not start until one of those 5 tasks finishes. It seems to be a product issue that the task is stuck which needs to be investigated separately with our support team, but in order to avoid such situations (which seem to be "corner" cases) it would definitely make sense to distribute the load between 2 appliances, i.e. deploy a new appliance and create backup tasks for it separately (so that you can remove these tasks from the 1st appliance). The idea is to avoid situations when there are more than 5 tasks are tried at the same time.

What concerns the Exchange-aware task - it seems to be related to VSS issues (the vss_manifests.zip meta package is not formed properly on the datastore) and can be checked by following instructions from http://kb.acronis.com/content/31347 article

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

We are not intentionally running 5 tasks at a time. What happens is one starts and then hangs. Then time goes by until another task starts and then that hangs. And this keeps happening until 5 are in the queue.

Anyway the backups no longer run now because I thought it would be a good idea to run vmware update manager. I updated all 3 ESX 4.1.0 servers to build 988178 and then all the vmware tools in the guests needed to be updated. We did those updates across a few days time as schedule permitted for the guests to be rebooted. Anyway that post is here : http://forum.acronis.com/forum/40298

I posted there because now a simple vmware update caused a new issue with Acronis working.

I sent an email to customerservice@acronis.com with 16 attached job logs that acronis e-mails us, along with the PDF of our invoice showing we paid for another year of support. For some reason acronis.com does not think I have support when in fact I do.

I want to give Acronis a chance to fix this product once and for all. We've had a lot of issues with it, the most of any other product. I hope we can resolve it in a new version (9???) before having to result to a different solution.

I just force rebooted the virtual appliance again because there was a job in there "canceling" since 2/21 (4 days ago!!!!)

frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

Thank you for the clarifications. I have notified our tier 2 support managers about your request and we'll check it with our experts (tier 1 won't be of big help here so escalation will be required in any case).

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

Since they all ran well last night, lets keep an eye on it for a week and see how things continue.

Seems like once in awhile things go haywire and you have to edit each job to kind of 'reset' it - if you will.

I have 5 stuck tasks today. Also in vSphere client under recent tasks there is one Acronis vmProtectBackup at 100%. Any idea how to clear this out without rebooting vcenter?

I did send info to the ticket that was opened but then on sunday it looked like an automated system sent me a message asking if I wanted to close it??? I'm still waiting for a response from the logs and everything I sent in the last communication... why would I want to close it? So of course I replied to that email.

frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

When the tasks are shown in vSphere client UI it's not possible to remove them without vCenter reboot unfortunately. I'd recommended to disable the vCenter Integration feature (Configure->ESXi hosts->disable vCenter integration) for now until the original underlying issue is resolved. I've also asked our our support team managers to check the case status - most likely there is some inconsistency in the status of it in the request tracking system which is why the wrong notification was sent.

Thank you.
--
Best regards,
Vasily
Acronis vmProtect

The tasks are still sticking. Do you need any information before I reboot the appliance again?

frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

It would make sense to pick up the contents of /var/lib/Acronis/vmProtect folder (packed in .zip - instructions can be found in troubleshooting section of http://kb.acronis.com/content/36100 article) + screen shot of "top" command outputs (executed from vmProtect appliance consoled: Ctrl+Shift+Alt+Space+F1, then Alt+F2) before rebooting the appliance, so that we can check the state of it where it hangs. Please send this information to our support team (into the corresponding ticket where the problem is investigated), so that we can analyze it and advise.

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

Thank you, i just sent it in to Igor Chelyukanov
[ ref:_00D30Zcb._50050KKmVV:ref ]

After getting this information while in the frozen state, I force rebooted the virtual appliance so backups would attempt to run tonight. Our last successful backup was 3/17 (7 succeeded, 1 failure). So 2 days without backups is as long as we like to go.

Keep in mind we did increase the virtual appliance memory up to 2 GB. I can increase it another GB if you want. I also did send into customer service a list of our backup job schedule. One of the concerns was if too many jobs were running at the same time. That is not the case. What happens is a job starts and hangs. So its there forever and then another job starts and hangs. This happens until all 5 available concurrent backup job slots are in use. They all hang so it prevents further backups in the future. There is no way out, not even canceling the job (it hangs at canceling). The only way out is to force boot the appliance.

The issue is presenting itself differently now. I did force reboot the VA after I obtained and sent requested information to the ticket.

So when I came in today at a first glance it appears that all jobs ran OK. The tasks summary panel on the dashboard doesn't show anything 'running'. The tasks statistics show all green bars. However if I go to VIEW > TASKS and scroll down past a bunch of the succeeded, I end up seeing 5 tasks in the "Running" state at 0% progress.
10 jobs Succeeded followed by:
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 18:30 Status: Running, 0%
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 19:45 Status: Running, 0%
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 20:30 Status: Running, 0%
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 21:00 Status: Running, 0%
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 21:45 Status: Running, 0%
Followed by 5 tasks that say Waiting, Cancelled, then one task that says Idle, Cancelled.
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 22:00 Status: Waiting, Cancelled
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 22:00 Status: Waiting, Cancelled
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 22:45 Status: Waiting, Cancelled
Last finish time: 03/19/2013 04:58 Next run: 03/20/2013 23:30 Status: Waiting, Cancelled
Last finish time: 03/19/2013 04:58 Next run: 03/21/2013 00:15 Status: Waiting, Cancelled
Last finish time: 03/19/2013 04:58 Next run: 03/21/2013 03:00 Status: Idle, Cancelled
Then lastly 9 more Succeeded tasks.

So 19 successes and 11 that did not run.

I will also copy this to the support email.

Attachment Size
127892-107068.jpg 161.77 KB
frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

Thank you for the update. If I correctly understood there are 30 tasks running over night and the interference (some of them run simultaneously) between them is the most likely reason for such "hangups".

There is a way to extend the limit from default 5 tasks up to any number which should help in your case. The instructions are similar to http://kb.acronis.com/content/30926 (there is just different file modified) where debug logging is enabled. To extend the limit you should issue the following command from the appliance command prompt:

#vi /etc/Acronis/VMMS.config

Then there should be MaxTaskCount value modified from 5 to 30 (theoretical maximum in your case) via pressing "i" to enter editing mode, then modify the value, then hit ESC to exit editing mode, then issue ":wq" to exit the 'vi' editor. Then reboot the appliance via 'reboot' command.

Note that having more than 5 tasks executed at the same time will require more RAM assigned to the virtual appliance. 4GB of RAM should be enough in your case (practically each task requires additional ~100mb of RAM).

On our side we will also test this particular scenario to see if we can reproduce similar "hangups".

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

Attachment Size
127893-107071.png 55.42 KB

Ok my /etc/Acronis/VMMS.config does not have MaxTaskCount at all. I searched for SimultaneousRunRules and MaxTaskCount and that text pattern is not found.

Can I simply type it in and where? My VMMS.config is smaller than yours. It is 47 lines

EDIT: Attached the file (renamed to .txt so the forum would accept the file).

Attachment Size
127894-107074.txt 27.02 KB
frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

Yes, you are correct. Looks like I've taken the appliance where I've already created these keys some time before, sorry for misleading. On "clean" appliances the following strings (not just one) should be added under tag:

            <key name="SimultaneousRunRules">
                <value name="MaxTaskCount" type="TString">
                    "30"
                </value>
            </key>

See also 2 screen shots showing the states of the file before and after modification.

P.S. I have adjusted your vmms.config.txt to reflect the proper changes (see attach). You can put it back into your appliance and reboot to apply the changes.

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

Attachment Size
127897-107077.png 43.51 KB
127897-107080.png 40.11 KB
127897-107083.txt 27.14 KB

Vasily,

Thank you so much for all your help. I was able to backup the old vmms.config and replace your edited one in its place. I also increased the RAM to 4GB for the virtual appliance.

I will see how the backups work tonight and hopefully this will help with the issue.

Well last night only 1 task failed and it was the very common "'CreateSnapshot' has failed. Reason: fault.FilesystemQuiesceFault.summary.".

So as far as THIS original issue, no tasks stuck. Maybe the 4GB memory increase and increasing the job limit worked? Time will tell though because each night was a random result. They didn't ALWAYS freeze... it's just been lately.

And as far as the create snapshot thing... that is a different issue. There's usually always at least one or two random machines a night that say that. We've been batteling that since AVP 6. Hopefully in AVP 9 that issue is fixed. I'd like to see if a snapshot fails, then use some sort of agent (ala backup-exec) to just talk to the VM at another level and backup files.

Also debating on eventually upgrading are VMWware 4.1 U3 environment to VMWare 5.0 U2. I'd go to 5.1 but we have a Proliant DL380 G5 at our DR site running VMware site recovery manager. ESXi 5.1 isn't certified for that server so we are only as good as our weakest link.

frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

Looks like my assumptions were correct and that's what we need to double-check with our QA (emulate highly loaded appliance, by running and finishing more than 5 default tasks at a time). What concerns the "fault.FilesystemQuiesceFault.summary" error - it's a general quiesced snapshot failure issue (see http://kb.acronis.com/content/4559 for links to VMware articles related to this problem) and from what I saw in the past VMware is improving the snapshot creation process from version to version and it is getting better. In your case it may be that there are tasks interferences where several snapshots are created at the same time thus causing load peaks at random times of the backup window which may explain why you see such behavior randomly.

On our side we are considering adding an option to failback to non-quiesced snapshot in case snapshot creation fails + in 9th version we have already added an option to re-try the snapshot attempt if it fails, so that the entire task does not skip one VM, but retry the snapshot (quite often quiescing may not work now, but will work 5 minutes later, i.e. fails randomly).

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager

Hi KJSTech,

Is it possible to look through your ESXi performance charts for the time when creating snapshots fail? I wonder if your disk system is overloaded perhaps by one or more large snapshots being removed, which then causes your disk latency to jump and ESXi times out and gives up. You could also look in the log and see if a snapshop is being removed and during that time it attempts to create a snapshot. It could just be that your disk system can't handle so many concurrent snapshop operations, especially if the failures seem to be random.

Hi, I was out of the office since 3/22 and just returned today on 4/8. I just replied to Igor Chelyukanov.

All 30 jobs are in the running state. One is at 90%, the remaining are at 0%.
The last completed job dates range from 3/21 through 3/24.

This means some machines haven't had backups in over two weeks. This product is just unacceptable. Please ensure Igor gets in touch with whatever information is needed. By 3 PM EST today I will have to reboot the virtual appliance, so if you need any data from it while it is in this state you must get in touch with me before 3 PM EST today, 4/8.

I believe the ticket number is 01860547 because that is the numbers in the subject of my email.

Here are some 1 month performance statistics showing the virtual appliance usage. Notice it looks like the virtual appliance took the same vacation I did.

Attachment Size
129243-107302.jpg 136.52 KB
129243-107305.jpg 155.81 KB
frestogaslorastaswastavewroviwroclolacorashibushurutraciwrubrishabenichikucrijorejenufrilomuwrigaslowrikejawrachosleratiswurelaseriprouobrunoviswosuthitribrepakotritopislivadrauibretisetewrapenuwrapi
Posts: 22
Comments: 3800

Hi KJSTech,

I've notified our support team about the incident and also passed some recommendations on which info to gather. I'd actually recommended to split the jobs between at least 2 appliances in order to avoid such problems.

On our side we're still running the environment with 30+ tasks backing up different VMs for more than 2 weeks in our QA lab and it still doesn't hang up :( Reproducing the problem is crucial for proper resolution. Hopefully the additional info captured by our support will help us here.

Thank you.
--
Best regards,
Vasily
Acronis vmProtect Program Manager