dedup hash collision avoidance
How do the dedup mechanisms in Acronis Cyber Backup make sure that no hash collision occurs in TIBX or ASN archives?
The technical paper "Deduplication in Acronis Backup Advanced" appears to have no information about this.


- Log in to post comments

Peter, exactly that is what I would like to know. There are a lot of articles around discussion hash collision avoidance in dedup solutions, and the probability seems to be largely dependant on the used algorithms and block sizes.
Unfortunately we are not allowed to post URLs here, but e.g. I found the following:
> Because hashing is CPU intensive, some products initially use a weak hashing algorithms to identify potentially duplicate data. This data is then rehashed using a much stronger hashing algorithm to verify that the data really is duplicate.
So I would be glad to learn that Acronis performs such additional checks for dedup candidates with same "quick" hashes.
Kind regards, Thomas
- Log in to post comments

Hello Thomas,
thank you for posting this question on Acronis forums!
I've found the following information in Acronis internal database:
What if two different blocks had the same hash? (hash-collision)
There will be wrong data recovered for the block which came last, because during recovery there will be used the first block recorded in deduplication store. This problem won't be detected by the product. This hash-collision problem occurence is very unlikely though, since we're using a reliable hash-calculating algorithm. The probability of hash-collision depends on the amount of blocks in the vault: for 4 billion (2^32) blocks (~16TB for 4k block size) the collision probability is 1/2^128.
- Log in to post comments

Maria, thanks for this detail. The Technical Whitepaper about Acronis dedupe states that
The data block size varies from 1 byte to 256KB for disk-level and file-level backups. Each file that is less than 256KB is considered a complete data block. Files larger than 256KB are split into 256KB blocks.
As far as I understand the probability of hash collision rises with the block size when the same hash function is applied.
Are you changing the algorithm according to the actual block size?
What is the estimated probability of Acronis hash-collision for 265 KB data blocks?
I fear that there are no mechanisms to avoid the hash-collision issue other than deactivating it?
Beste Regards, Tom
- Log in to post comments

Hello Tom.
What is the estimated probability of Acronis hash-collision for 265 KB data blocks?
The probability remains the same. It depends on the amount of blocks, not on their size.
Deduplication of disk blocks is not performed if the volume's allocation unit size—also known as cluster size or block size—is not divisible by 4 KB.
Are you changing the algorithm according to the actual block size?
The algorithm remains the same: MD5 (128-bit)
- Log in to post comments

Hello Thomas!
Is there a reason you are worried about a 1/2^128 chance of collision? I'd wager that chance is lower then adding up all other factors that can make a backup corrupt.
-- Peter
- Log in to post comments

Maria, thanks for providing the implementation details: MD5 (128-bit), block sizes up to 256 KB.
While the 1/2^128 chance of collision seems to be acceptable, I still have the feeling that the 128 bit MD5 algorithm chosen by Acronis is a rather risky approach, compared to ZFS choice of SHA-256 Bit (e.g. ZFS) and even SHA-512 Bit (e.g. UrBackup).
Are solutions like ZFS or UrBackup overdoing things?
Tom
- Log in to post comments

Hello Thomas!
Probably overkill. Take a look at this Stack overflow discussion about hash collissions. Taken from the linked answer, the chance for collissions in SHA-256 for 1 billion messages is ~10^-60, MDA-128 about ~10^-38 and an asteroid wiping out mankind in the next second is 10^-15.
Not familiar with those backup solutions, but it's possible they need cryptographic hashes to temper-proof the backups themselves. For similar applications, Acronis Notary looks like it uses SHA-256.
-- Peter
- Log in to post comments

In our team we came to the conclusion that the chance of dedup hash collision hitting might be very low.
Still, with a rapidly growing amount of data we would still appreciate if Acronis planned to switch over to a better hash function in the long run.
Are there any plans / enhancement requests to move away from MD5?
- Log in to post comments