Programming : find the first unique string in a file in just 1 pass

Tag: algorithm , data-structures Author: wsophy Date: 2012-12-31

Given a very long list of Product Names, find the first product name which is unique (occurred exactly once ). You can only iterate once in the file.

I am thinking of taking a hashmap and storing the (keys,count) in a doubly linked list. basically a linked hashmap can anyone optimize this or give a better approach

You only need one set of unique names, and a set of repeated names. For each name, you just need to check which set, if either it is in. For the unique names, use a list, for the repeated use a hashtable.

Best Answer

Since you can only iterate the list once, you have to store

  • each string that occurs exactly once, because it could be the output
  • their relative position within the list
  • each string that occurs more than once (or their hash, if you're not afraid)

Notably, you don't have to store the relative positions of strings that occur more than once.

You need

  • efficient storage of the set of strings. A hash set is a good candidate, but a trie could offer better compression depending on the set of strings.
  • efficient lookup by value. This rules out a bare list. A hash-set is the clear winner, but a trie also performs well. You can store the leaves of the trie in a hash set.
  • efficient lookup of the minimum. This asks for a linked list.

Conclusion:

Use a linked hash-set for the set of strings, and a flag indicating if they're unique. If you're fighting for memory, use a linked trie. If a linked trie is too slow, store the trie leaves in a hash map for look-up. Include only the unique strings in the linked list.

In total, your nodes could look like: Node:{Node[] trieEdges, Node trieParent, String inEdge, Node nextUnique, Node prevUnique}; Node firstUnique, Node[] hashMap

If you strive for ease of implementation, you can have two hash-sets instead (one linked).

comments:

not maintaining the linked structure for repeated strings is a good idea as i don't need them

Other Answer1

The following algorithm solves it in O(N+M) time. where

N=number of strings

M=total number of characters put together in all strings.

The steps are as follows:

`1. Create a hash value for each string`

`2. Xor it and find the one which didn't have a pair`

Xor has this useful property that if you do a xor a=0 and b xor 0=b.

Tips to generate the hash value for a string: Use a 27 base number system, and give a a value of 1, b a value of 2 and so on till z which gets 26, and so if string is "abc" , we compute hash value as: H=3*(27 power 0)+2*(27 power 1)+ 1(27 power 2) =786 You could use modulus operator to make hash values small enough to fit in 32-bit integers.If you do that keep an eye out for collisions, which are basically two strings which are different but get the same hash value due to the modulus operation. Mostly I guess you won't be needing it.

So compute the hash for each string, and then start from the first hash and keep xor-ing, the result will hold the hash value of the string which din't have a pair. Caution:This is useful only when strings occur in pairs.Still this is a good idea to start with, that's why I answered it.

comments:

I don't think this works - what if every string in the file is unique? In that case, not everything cancels out. Or what if something appears three times? Similarly, what do you do if you have a hash collision?
For the first question:my algorithm fails, I have mentioned the other problems in my answer, I just wanted to give a start so the person who asked the question can figure out if its useful for them.Can someone help me with fixing the first issue?

Other Answer2

Using a linked hashmap is obvious enough. Otherwise, you could use a TreeMap style data structure where the strings are ordered by count. So as soon as you are done reading the input, the root of your tree is unique if a unique string exists. Unlike a linked hash map, insertion takes at most O(log n) as opposed to O(n). You can read up on TreeMaps for insight on how to augment a basic TreeMap into what you need. Also in your linked hashmap you may have to travel O(n) to find your first unique key. With a TreeMap style data structure, your look up is O(1) -- the root. Even if more unique keys exist, the first one you encountered will be the root. The subsequent ones will be children of the root.

comments:

when a new string comes how are you calculating its count ? are you storing it somewhere or what ? If a string has come already , if it comes again after sometime how will you insert it in TReemap if you dont know its count , plus again if there are more then 1 unique strings in the file , what will treeMap's root give me in that case , Is it guranteed it will give me the first unique string