Finding length of substring

Tag: string , algorithm , search , substring Author: dapengkaoyan Date: 2013-04-23

I have given n strings . I have to find a string S so that, given n strings are sub-sequence of S. For example, I have given the following 5 strings:


Then the string is "ACAGTGCT" . . Because, ACAGTGCT contains all given strings as super-sequence. To solve this problem I have to know the algorithm . But I have no idea how to solve this . Guys, can you help me by telling technique to solve this problem ?

I think there is some limit on S's size, otherwise you can simply concatenate all given strings.
I have to find optimal string S.
I don't understand your example. ACAGTGCT does NOT contain all given strings as sub-sequence. Can you better explain what you are looking for?
I have given n strings (n<=8). I have to find shortest common super-sequence of these strings . What procedure should I have to follow ?

Other Answer1

This is a NP-complete problem called multiple sequence alignment.

The wiki page describes solution methods such as dynamic programming which will work for small n, but becomes prohibitively expensive for larger n.

The basic idea is to construct an array f[a,b,c,...] representing the length of the shortest string S that generates "a" characters of the first string, "b" characters of the second, and "c" characters of the third.


Can you give the algorithm specifically here to solve this problem ?

Other Answer2

My Approach: using Trie

Building a Trie from the given words.
create empty string (S)
create empty string (prev)

for each layer in the trie
    create empty string (curr)
    for each character used in the current layer
        if the character not used in the previous layer (not in prev)
            add the character to S
            add the character to curr
    prev = curr

Hope this helps :)


What do you want to mean by the word "trie" ??
The Trie data structure ! I'll add a link to it,, see update.

Other Answer3

1 Definitions

A sequence of length n is a concatenation of n symbols taken from an alphabet . If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous. A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T. A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.

2 The problem

The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.

2.1 Example

S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc

One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).

3 Techniques

Dynamic programming Requires too much memory unless the number of input-sequences are very small. Branch and bound Requires too much time unless the alphabet is very small. Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1] Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1] Genetic algorithms Indications that it might be better than majority merge. [1]

4 Implemented heuristics

4.1 The trivial solution

The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc. 4.2 Majority merge heuristic The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:

WHILE there are non-empty input sequences s <- The most frequent symbol at the start of non-empty input-sequences. Add s to the end of S. Remove s from the beginning of each input sequence that starts with s. END WHILE

Majority merge performs very well when the number of sequences is large compared to the alphabet size.

5 My approach - Local search

My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.

Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.

I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.

That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:

xi {k, l}, where k L and l L(k)

To be sure that any solution is valid we need to introduce the following constraints:

  1. Every symbol in every sequence may only have one xi mapped to it.
  2. If xi ss,k and xj ss,l and k < l then i < j.
  3. If xi ss,k and xj ss,l and k > l then i > j.

The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.

5.1 The initial solution

There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.

The first one is to create an initial solution by simply concatenating all the sequences.

The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.

5.2 Local change and the neighbourhood

A local change is done by exchanging two mappings in the solution. One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping. Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.

There are two variants I have tried.

In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.

In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.

A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.

The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.

5.3 Evaluation

Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.

The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:

T = {}
  FOR i = 0 TO Sl
    found = FALSE
    FOR j = 0 TO |L|
       IF first symbol in L(j) = the symbol xi maps to THEN
          Remove first symbol from L(j)
          found = TRUE
       END IF
    END FOR 
    IF found = TRUE THEN  
      Add the symbol xi maps to to the end of T            
    END IF

Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.

The value of the solution S is obtained as |T|.

With Many Many Thanks to : Andreas Westling