Using zip to find the difference between two strings in Scala

Tag: string , scala Author: ai19941994 Date: 2011-11-08

This is a follow up to my previous question.

I noticed that if I manipulate two strings (e.g. to get the longest common prefix, calculate the difference between two strings, etc.) I tend to use zip (like in (s1 zip s2)...).

It looks nice but probably inefficient (in comparison to imperative code). Is it correct ? Perhaps, if performance does matter I should use an imperative algorithm.

Now I am trying to find if two strings differ. Would you recommend using zip to do it?

Best Answer

It is slightly more efficient to use (s1,s2).zipped than (s1 zip s2) since you don't need to create the tuples; instead you create functions that take two arguments.

Better yet for efficiency but not ease of use is to define your own custom specialized string folder:

abstract class StrFold[@specialized T](var result: T) {
  def apply(c1: Char, c2: Char): Unit
  def done: Boolean
def strfold[@specialized T](s1: String, s2: String)(folder: StrFold[T]): T = {
  var i = 0
  val L = math.min(s1.length, s2.length)
  while (i < L) {
    folder(s1.charAt(i), s2.charAt(i))
    if (folder.done) return folder.result
    i += 1

Now if you want to find the length of the longest common prefix, you

class LcpFind extends StrFold(0) {
  var done = false
  def apply(c1: Char, c2: Char) { if (c1 == c2) result += 1 else done = true }
def lcp(s1: String, s2: String) = strfold(s1,s2)(new LcpFind)

It's a fair bit of work--arguably almost as much as just writing out the imperative or recursive functional code to calculate LCP without a fold-with-escape-clause.

But it should be almost as fast as writing the low-level string stuff by hand each time. (The only penalty should be creating the (tiny) LcpFind object each time, and if you really wanted to you could reuse it by resetting result to zero between calls, or you could modify StrFold and strfold to implement and call a reset method, changing the input parameter to be named zero and having a separate result method.)

Other Answer1

Yes, it's going to be less efficient for two reasons.

  1. You are creating a list of pairs of characters the same length as the shorter string. This will take substantially more place than the original two strings. Usual ways to find longest common prefix or check if strings differ in one character don't require any additional memory.

  2. When finding LCP, you don't necessarily need to iterate over entire strings. Since zip does need to do it, it will obviously be less efficient to zip them first.

But this doesn't mean you have to use imperative algorithms: a normal tail-recursive functional algorithm will perform just as well.

Other Answer2

It looks nice but probably inefficient (in comparison to imperative code).

It copies both inputs, so it's effectively O(n) space while finding the longest common prefix can be time in O(1) memory. Also, it takes O(n) time while O(len(LCP)) time would suffice (though that's O(n) time in the worst case) and it does memory allocation; that's an expensive operation.

Now I am trying to find if two strings differ in exactly one char. Would you recommend using zip to do it?

Depends on how long the strings are and whether performance is crucial. For a first shot, it's probably ok to use zip.

Other Answer3

You can use a view to have lazy evaluation when performing your zip (so it will zip only what you take).