Algorithm to Find if M28K is unique

Tag: string , algorithm Author: btxiaohu Date: 2012-07-24

Today my younger brother asked me a question, the question is as follows:

Given a list of strings & string M28K, where M28K represents a string which starts
from M, ends with K and has 28chars in between . Find if M28K is unique in the 
list of strings or not?

I came upto the following algorithm to find the solution for the problem:

For each string:

find string length(L)
  if(L==30) then
      if(str[0]=='M' && str[L-1]=='K') then
          verify rest of 28 characters are matching or not

This solution doesn't seems to be efficient in terms of time complexity. Can anyone give a better algorithm to solve this problem?

define "is unique in the list of strings". Does that mean, "M28K can be found exactly once within the list of strings", or "M28K does not exist in the list of strings"? In either case, I think the best you can do is O(n) time, which is what your solution is! Nice work.
possible duplicate of efficient way to search for string in list of string? or 100 other questions on the site asking how to find a string in a list. The fact that we know what letters it starts and ends with is irrelevant.
@Kevin "is unique in the list of strings" means, the given list of strings doesn't contains M28K at all.
@Kevin Jaguars algorithm is not O(n). Please see my explanation.

Best Answer

I would go with hashing. Usually, since this sounds like an algorithms homework problem, in my experience, we were not allowed to answer with hashing beacause it really depends on your hash function. If it is not good enough, then you won't get unique values for each string.

I would build the list of strings into a binary sort tree based on the characters in the string. Maintaining an algorithm that says if the string comes before the head node in alphabetical order, place it to the left, and if it comes after the head node, place it to the right. Recursively of course. We have a tree. Now granted worst case this will be completed in O(n) time, which would just effectively be a linked list, but with a good head node, somewhere in the m or n area, this lookup can be completed in O(log n). So the whole operation would take O(n log n) time.

Your provided algorithm, worst case would take O(n^2). Let's say every string was 30 characters, and ended with K and began with L. All excpet the 2nd to last character. Effectively we would search 28 characters of all the provided strings. The n^2 comes into play with finding the size of all the strings. Each string would take O(n) time making it an n^2 algorithm. In my algorithm, we are halving the problem each time, which provides a lot quicker of a search.


But isn't O(n*log(n)) slower than the O(n) solution he originally had?
@iSelkiies you don't need to find the length at all. You can try figure the length of a string, and when you see it is more than 30, you can stop counting the length of the current string, and go for than next string.
iSelkiies, I think you're confusing your variables here. Finding the length of a null-terminated string is indeed O(n), where n is the length of the string. and iterating through a list is O(n) time, where n is the length of the list. Since n represents a different thing in both of those, you can't smoosh them together into O(n^2) -- you need to replace one of the letters, in which case the algorithm is O(m*n).
+1 to Kevin on all points. This answer seems to be using n to represent two different things (the length of the list, and the length of a string). If you properly separate them out, then the comparison of two strings is O(m), which makes your algorithm O(m*n log n). OP's algorithm is O(n*m), which is vacuously the best we can do (since the list does not start out sorted, the best we can do is compare against each string exactly once. Since all comparison-sorts must look at each element at least once, you cannot do any better by sorting the list first).
@iSelkiies you are wrong. Assuming the length of the strings is O(m). If you use my approach ,checking length until you reach 30 - if you see the 31 character you stop, you will get the checking the length of every string is O(1) while you ignore non-relevant strings. It is not quadratic run-time for sure.