I recently gave a "mental challenge" in form of writing a clusterBy
function that take a list and a function that calculates some result on every element of the list; the result of the clusterBy
function should be a list of clusters. A cluster is a list of elements from the original list for which the clustering function returns the same value.
("one", "two", "three", "eleven", "twelve")
should cluster to (List(three), List(twelve, eleven), List(two, one))
. We create three clusters: one for strings of length 3, one for strings of length 5 and one for strings of length 6. We then add the values to the appropriate cluster. I had a go at implementing the code in Scala, and here's my first attempt. For historical purposes, and to show different approaches, I will keep returning to this post as I discover better implementations. def clusterBy[A, B](l: Iterable[A])(f: A => B) = l.map(e => (f(e), e))./:(new HashMap[B, List[A]])( (res, cur) => res + ((cur._1, cur._2 :: res.getOrElse(cur._1, Nil))) ).values
Calling clusterBy(List("one", "two", "three", "eleven", "twelve"))(_.length)
returns MapLike(List(three), List(twelve, eleven), List(two, one))
. Success! So, what did I do? Let's explore each step of the way.
![cluster-by-1](https://static.oschina.net/uploads/img/201707/19152824_oq9J.png)
def clusterBy[A, B](l: Iterable[A])(f: A => B)
, which defines a function called clusterBy
with two generic types A
and B
--that's the def clusterBy[A, B]
bit. The function is curried--it takes two parameter lists. The first parameter list includes just one element, l: Iterable[A]
. This is the source data, a container that holds source values of type A
. The second parameter list includes just one element, f: A => B
, which is a function that takes the element type (A
) and maps it to some other type B
. The next line starts with l.map(e => (f(e), e))
which maps the elements of the parameter l
into some other container that contains tuples made up of the mapped element and the original element. So, if l
were List("x", "xy")
, l.map(e => (f(e), e))
would return List((1, "x"), (2, "xy"))
. Next, we fold left the returned list using empty map (that's the new HashMap[B, List[A]]
--our grouped by container). The fold operation (working on the tuples) takes the first element of the tuple as the key in the group by map and the second element of the tuple is the value that should be added to the value associated with the key in the map. The line (res, cur) => res + ((cur._1, cur._2 :: res.getOrElse(cur._1, Nil)))
receives the res
running result of the fold left operation (the map) and cur
, which is every (f(element) -> element) pair. The operation returns a map that is constructed by putting (in the old Java speak) a pair where the key is cur._1
and the value is either a new list with just cur._2
in it, or a list constructed by adding cur._2
to the existing list.
values
of the constructed group by container.