character frequency calculation in string

Tag: algorithm , space-efficiency Author: chenlongfei20009 Date: 2012-08-10

I am looking for the most efficient (time and space) algorithm for character frequency calculation for a given string.

The simplest algorithm that comes to mind is to have a flag-array (size = number of different characters) you want to search and increment the counter for the corresponding index. This works in linear time. Only problem with this is the space requirement of the flag-array, which could go upto 256 if all ASCII characters are needed.

Is there a better algorithm, which could save on space/time?

How do you expect to have less than m memory, if you are counting occurrences of m different characters in a string?
A given string might not have all 256 characters of ASCII, in that case, there might be an algorithm which could store the frequency of only the "found" characters... also, I am not only looking for improvement in space, any alternative suggestion with time improvement is also welcome
You could build a dynamic data structure that would store only the frequencies of the found characters. It would have to be efficient in searching the frequencies of found characters, so you can increment them quickly. You could use a hash table or a binary search tree. However, unless you will have much more than 256 possible characters it's probably not going to help you and might actually harm the efficiency. An array of 256 integers is really negligible unless you are writing from the 60s.
There are only 128 ASCII characters.

Best Answer

If you use a hash table to store the counters, you need space proportional to the number of different characters in your string and you can still run the computation in linear time. It is easy to see that you cannot get better than linear time, since you need to look at each character at least once.

In practice however, if your string really only uses one byte to store a character (i.e. it is not Unicode) your "flag array" will only be something about 1 kb and thereby probably be the best shot since it doesn't have the (constant factor) time and space overhead of the hash table.