```
1 Definitions
```

A sequence of length n is a concatenation of n symbols taken from an alphabet . If S is a sequence of length n and T is a sequence of length m and n m then S is a subsequence of T if S can be obtained by deleting m-n symbols from T. The symbols need not be contiguous. A sequence T of length m is a supersequence of S of length n if T can be obtained by inserting m-n symbols. That is, T is a supersequence of S if and only if S is a subsequence of T. A sequence T is a common supersequence of the sequences S1 and S2 of T is a supersequence of both S1 and S2.

```
2 The problem
```

The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.

```
2.1 Example
S= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc
```

One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).

```
3 Techniques
```

Dynamic programming Requires too much memory unless the number of input-sequences are very small. Branch and bound Requires too much time unless the alphabet is very small. Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1] Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1] Genetic algorithms Indications that it might be better than majority merge. [1]

```
4 Implemented heuristics
4.1 The trivial solution
```

The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc. 4.2 Majority merge heuristic The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:

WHILE there are non-empty input sequences s <- The most frequent symbol at the start of non-empty input-sequences. Add s to the end of S. Remove s from the beginning of each input sequence that starts with s. END WHILE

Majority merge performs very well when the number of sequences is large compared to the alphabet size.

```
5 My approach - Local search
```

My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.

Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.

I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.

That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:

xi {k, l}, where k L and l L(k)

To be sure that any solution is valid we need to introduce the following constraints:

- Every symbol in every sequence may only have one xi mapped to it.
- If xi ss,k and xj ss,l and k < l then i < j.
- If xi ss,k and xj ss,l and k > l then i > j.

The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.

```
5.1 The initial solution
```

There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.

The first one is to create an initial solution by simply concatenating all the sequences.

The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.

```
5.2 Local change and the neighbourhood
```

A local change is done by exchanging two mappings in the solution. One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping. Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.

There are two variants I have tried.

In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.

In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.

A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.

The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.

```
5.3 Evaluation
```

Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.

The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:

```
T = {}
FOR i = 0 TO Sl
found = FALSE
FOR j = 0 TO |L|
IF first symbol in L(j) = the symbol xi maps to THEN
Remove first symbol from L(j)
found = TRUE
END IF
END FOR
IF found = TRUE THEN
Add the symbol xi maps to to the end of T
END IF
END FOR
```

Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.

The value of the solution S is obtained as |T|.

With Many Many Thanks to : Andreas Westling