haskell grouping problem

Tag: haskell Author: wangcong2566 Date: 2009-10-26
group :: Ord a => [(a, [b])] -> [(a, [b])]

I want to look up all pairs that have the same fst, and merge them, by appending all the list of bs together where they have the same a and discarding the unnessecary pair and so on...

I got as far as:

group ((s, ls):(s', ls'):ps) = 
    if s == s' 
    then group ((s, ls++ls'):ps) 
    else (s, ls) : group ((s', ls'):ps)
group p = p

but obviously this ain't going to cut it, because it doesn't group everything.

Edit: example

[("a", as),("c", cs), ("c", cs3), ("b", bs),("c", cs2), ("b", bs2)]

would output

[("a", as),("c", cs++cs2++cs3),("b", bs++bs2)]
can you give an example input and output?
To be clear: one cannot assume that the tuples are ordered by their first value?
indeed it cannot
Would it not be fairly easy to make it so, just by sorting the list first? You could either use sort if both parts of the tuple derive from Ord, or use sortBy and use a comparison function.

Best Answer

Two alternative solutions to barkmadley's answer:

  • As Tirpen notes in a comment, the best way to attack this problem depends on the number m of distinct first elements in the tuples of the input list. For small values of m barkmadley's use of Data.List.partition is the way to go. For large values however, the algorithm's complexity of O(n * m) is not so nice. In that case an O(n log n) sort of the input may turn out to be faster. Thus,

    import Data.List (groupBy, sortBy)
    combine :: (Ord a) => [(a, [b])] -> [(a, [b])]
    combine = map mergeGroup . myGroup . mySort
        mySort = sortBy (\a b -> compare (fst a) (fst b))
        myGroup = groupBy (\a b -> fst a == fst b)
        mergeGroup ((a, b):xs) = (a, b ++ concatMap snd xs)

    This yields [("Dup",["2","3","1","5"]),("Non",["4"])] on barkmadley's input.

  • Alternatively, we can call in the help of Data.Map:

    import Data.Map (assocs, fromListWith)
    combine :: (Ord a) => [(a, [b])] -> [(a, [b])]
    combine = assocs . fromListWith (++)

    This will yield [("Dup",["5","1","2","3"]),("Non",["4"])], which may or may not be an issue. If it is, then there are again two solutions:

    • Reverse the input first using Data.List.reverse:

      import Data.List (reverse)
      import Data.Map (assocs, fromListWith)
      combine :: (Ord a) => [(a, [b])] -> [(a, [b])]
      combine = assocs . fromListWith (++) . reverse
    • Prepend (flip (++)) instead of append ((++)) (Thanks to barkmadley; I like this solution better):

      import Data.Map (assocs, fromListWith)
      combine :: (Ord a) => [(a, [b])] -> [(a, [b])]
      combine = assocs . fromListWith (flip (++))

    Both of these definitions will cause combine to output [("Dup",["2","3","1","5"]),("Non",["4"])].

As a last remark, note that all these definitions of combine require the first element of the tuples in the input list to be instances of class Ord. barkmadley's implementation only requires these elements to be instances of Eq. Thus there exist inputs which can be handled by his code, but not by mine.


Nice, I didn't know about 'fromListWith'. That's really the best answer provided the first element of the pairs is an instance of Ord.
combine = assocs . fromListWith (flip (++)) would also solve the reversal problem
mergeGroup = prod(head,concat) . unzip where prod (f,g) x = (f $ fst x, g $ snd x) is the "product" of a pair of functions.

Other Answer1

import Data.List hiding (group)

group :: (Eq a) => [(a, [b])] -> [(a, [b])]
group ((s,l):rest) = (s, l ++ concatMap snd matches) : group nonmatches
        (matches, nonmatches) = partition (\x-> fst x == s) rest
group x = x

this function produces the result:

group [("Dup", ["2", "3"]), ("Dup", ["1"]), ("Non", ["4"]), ("Dup", ["5"])]
    = [("Dup", ["2", "3", "1", "5"]), ("Non", ["4"])]

it works by filtering the remaining bits into two camps, the bits that match and the bits that dont. it then combines the ones that match and recurses on the ones that don't. This effectly means you will have one tuple in the output list per 'key' in the input list.


Your 'doubleFilter' function exists in Data.List as 'partition'.
Doesn't the choice to partition (doubleFilter) the list result in O(n^2) complexity? If so, then sorting the list first (see my solution) may improve the algorithm to O(n log n). Is this assertion correct or not?
you are correct. you could even increase efficiency on your solution by writing a custom mergesort that merges the groups in the merge phase, instead of merging after sorting.
I think it will be O(n*m) where m is the number of distinct values the first element in the tuples will take, so depending on the input, the sort and group version could be either faster or slower.
@barkmadley: That seems more trouble than it's worth :). By the way, you can now remove the import Data.Monoid statement, since you no longer use mappend.

Other Answer2

Another solution, using a fold to accumulate the groups in a Map. Because of the Map this does require that a is an instance of Ord (BTW your original definition requires that a is an instance of Eq, which barkmadley has incorporated in his solution).

import qualified Data.Map as M

group :: Ord a => [(a, [b])] -> [(a, [b])]
group = M.toList . foldr insert M.empty
    insert (s, l) m = M.insertWith (++) s l m

If you're a big fan of obscurity, replace the last line with:

    insert = uncurry $ M.insertWith (++)

This omits the unnecessary m and uncurry breaks the (s, l) pair out into two arguments s and l.