String to unique int algorithm

Tag: algorithm Author: zhuyonggang123000 Date: 2011-09-11

We are trying to implement the following case. We have a invoice table and there is a column which has email address. We want to somehow generate a unique int value from this email address and store that in a separate column. This will be used as a FK and indexed. So what I am looking for is an algorithm for generating ints from strings (please note that the email string should always output the same int so each email address as a unique int representation). We can use a bigint as well

This has been asked many times before, and the short answer is it's impossible to take an infinite (or relatively infinite) domain (string/varchar) and map it 1:1 with a finite domain (int, bigint). You need to make compromises on uniqueness or the output data type. My suggestion is that you just index on the e-mail address itself.
I don't see why that would be so, @MarkPeters. Any string they get will be encoded in a finite number of bytes. Just interpret the same bytes as a bigint, and voila, you have a number.
@Tom: I'm not sure of the exact semantics of bigint. In Java, BigInteger has variable bit length so that would work (I already have a deleted answer along exactly those lines). My impression is that bigint in the context of SQL is still bounded (64 bit) and so it represents a finite domain.
I think the point @MarkPeters is making is that the number of possible email addresses is always going to be > the number of integers of a given length. For example, if using a 64-bit integer, there are 2^64 possible values. There are an infinite number of possible email addresses, which is > 2^64.
@TomZych If you've got an integer type with defined precision (e.g. 32bit), it's trivial to generate 2**32 + 1 different E-Mail adresses, and there goes your uniqueness.

Best Answer

Simplest solution is to put the email address into its own table along with an identity/auto_increment type column. Then you can simply carry around that identify field (a standard int), and you don't run into any issues with potential hash collisions, and no hashing overhead.

comments:

+1 Best answer, I think. Why complicate things when the OP hasn't made constraints ?

Other Answer1

It seems a simple hashcode (MD5, SHA1, ...) should fit your needs; depending on your RDBMS, you might be able to use built-in packages (e.g. Oracle's dbms_crypto) or have to compute them externally.

Some things to keep in mind:

  • convert everything to lower/uppercase before computing the hashcode (so [email protected] gets the same hashcode as [email protected])

  • apparently, you have a denormalized schema. It would make more sense to have a separate customer table containing the E-Mail adress; invoice should then contain only a foreign key customer_fk

comments:

But two email addresses can have the same hashcode/MD5..
@Raze2dust Yes, of course. But the probability for that is very, very small (esp. if using SHA1).
@FrankSchmitt Some people care about the difference between very, very small probability and impossible...and some don't.
@Frank Schmitt: Since the question mentions invoices, the people that care are customers, accountants, lawyers, etc. Potential collisions should be enough to disqualify any hashing scheme.
@Blastfurnace: MD5 and SHA1 collisions are an accepted tradeoff for HTTPS certificates, of which there are far more than most customer-email addresses a single company uses. Usually unless somebody is attacking the system and targeting for a collision this will no happen. Also note that the possibility is influenced by the grammar for email addresses as most collisions will appear for strings outside of the grammar or very long (and hence unusable addresses).

Other Answer2

MD5 - gives you a 128-bit integer. (Admittedly, this is bigger than the int datatype in most languages, but you won't get near guaranteed uniqueness with with just 32-bits.)

comments:

OP said he wanted a unique integer; MD5 does not guarantee uniqueness.
You cannot guarantee unique integers (unless using integers of arbitrary length), since there's an infinite amount of possible E-Mail adresses.
@FrankSchmitt There's a finite number of email addresses that will actually be used, so we could just index all of them as we see them.
@Michael: daiscog does not state that MD5 guarantees uniqueness. It states that it gets you near guranteed uniqueness.
Depending on inputs vs outputs, NOTHING "guarantees" uniqueness.

Other Answer3

I don't know if you can get away with a 64-bit int: the max length of an email address is 254 characters and, in this case where you need to preserve the uniqueness of each, hashing will not do it.

So it seems you are stuck with having to get over this 254-character hurdle. My approach (always the brute force approach for me) would be to take the alphabet of allowable characters in an email address, map those to 6-bit values, and use the map to pack them into a series of words.

Take a look at rfc3696 which deals with email addresses in a way that's actually comprehensible.

Sorry to be of so little help.