Question

Rihan8585 on Fri, 16 Dec 2016 10:20:04


Hello Experts, 

We are working on Data ware House , Loading data from Staging to DW table using SP/TSQL

To Implement Increment Load with in Dimension (Insert New Row and Update Existing row) we Planned to use Hashbyte function "SHAI" algorithm by concatenating 3 column of DimAccount

When approach was discussed with Client, they did not approved our approach and said "THERE WILL BE  hash collision (generating the same hashed output for two different input string of characters)".

We did not say anything to him , asking him to wait for some time so that we can come with more concrete solution 

Please assist , How to overcome this ?

 

Sponsored



Replies

Hilary Cotter on Fri, 16 Dec 2016 10:51:09


According to this link

http://dba.stackexchange.com/questions/35219/choosing-the-right-algorithm-in-hashbytes-function

It is possible - but rare. Using the symmetric or asymmetric keys will definitely create uniqueness but there will be a performance penalty to be paid in encryption and decryption (hash bytes is not decryptable) and you will be unable to put a useful index on the encrypted column.

Khaja Moizuddin on Fri, 16 Dec 2016 11:06:01


https://msdn.microsoft.com/en-us/library/ms174415.aspx

https://technet.microsoft.com/en-us/library/ms174415(v=sql.110).aspx

https://www.mssqltips.com/sqlservertip/2988/understanding-the-sql-server-hashbytes-hashing-algorithms/

http://www.databasejournal.com/features/mssql/getting-started-with-hashing-in-sql-server.html

Refer these blogs

Erland Sommarskog on Fri, 16 Dec 2016 13:09:54


As Hilary says, collisions are rare.

There is however another issue that you should keep in mind. I don't know how you use the function, but in SQL 2012, the function does not accept longer input than 8000 bytes. This means that if you are concatenating many values into a string, there could be truncation if you have a lot of data.

Tom Phillips on Fri, 16 Dec 2016 13:45:06


Your client is probably thinking of CHECKSUM instead of HASHBYTES.  CHECKSUM is not good and I would not recommend it.

Although possible, it is extremely unlikely if you use a unique value as the source of the hash and the size of the source of the hash matters. 

It is extremely rare however I have actually experienced a collision in an MD5 hash when someone was hashing every field in the row with a lot of large columns for detecting changes to a row on a 2TB database.  A single bit changed from 0 to 1 and the change was not detected, so the row did not get updated. 

Also keep in mind, if your column(s) are case-insensitive, HASHBYTES is case-insensitive.  It will NOT detect changes in case, and generate the same hash.

I would also suggest this:

http://www.backupcentral.com/mr-backup-blog-mainmenu-47/13-mr-backup-blog/145-de-dupe-hash-collisions.html

Erland Sommarskog on Fri, 16 Dec 2016 22:56:30


Also keep in mind, if your column(s) are case-insensitive, HASHBYTES is case-insensitive.  It will NOT detect changes in case, and generate the same hash.

I run this on a server with the collation Finnish_Swedish_CI_AS:

SELECT hashbytes('MD5', 'Camilla'), hashbytes('MD5', 'camilla')

I get this output:

0xB36104E8D6A7E47A01A5681CEFB6303D      0xB077F51FF36868F21EA52956ADCF7FF4