Over-generalization

Sometimes when I factor out a function, I see a more general algorithm hidden in that function. Of course, that algorithm should be factored out of the new function. After that, it can be generalized to accept arbitrary types, or to use iterators instead of a container, or have different numbers of parameters and so on. And if I am not careful, I spend hours until I realize I got carried away writing a simple function that should have taken no more than 15 minutes to write.

Example

This is a slightly modified real world example at the start. I caught myself relatively early in the process of generalizing the code, so what you will see here goes beyond what I actually did before I stopped.

Assume you have a bunch of objects, associated with certain values. These associations get altered, destroyed or newly created during a process, and you want to know what happened. You usually store the information about those associations in a `std::map<ID, value>` where the `ID` is a handle to the objects. In other words, you want to have a diff of two of these maps, one saved before, and the other after the process.

First roll

My first solution was to have a function that takes the two maps by non-const reference, leaving the unique keys (the associations that hat been deleted or created) in each map, and returning a new map containing the changed values. The unchanged values were not of interest, so they were simply discarded.

typedef std::map<ID, Value> Associations;

//...

std::map<ID, std::pair<Value, Value>> 
diffAssiciations(Associtations& before, Associations& after) {
  std::map<ID, std::pair<Value, Value>> changed;

  auto itBefore = begin(before);
  auto itAfter = begin(after);
  while (itBefore != end(before) && itAfter != end(after)) {
    auto idBefore = itBefore->first;
    auto idAfter = itAfter->first;

    if (idBefore < idAfter) {
      ++itBefore;
    } 
    else if (idAfter < idBefore) {
      ++itAfter;
    } 
    else /*same ID*/ {
      auto valBefore = itBefore->second;
      auto valAfter = itAfter->second; 

      if (valBefore != valAfter) {
        changed[idBefore] = make_pair(valBefore, valAfter);
      }

      itBefore = before.erase(itBefore);
      itAfter = after.erase(itAfter);
    }
  }  
  return changed;
}

That’s it. Written in less than ten minutes, a minute or two more to check if I missed some corner cases, and I was done.

Generalizations

It is easy to see this has not to be restricted to the maps at hand, it can be used for any map. So the first little modification is simple:

template <class K, class V>
std::map<K, std::pair<V, V>> 
diffMaps(std::map<K,V>& before, std::map<K,V>& after) {
  std::map<K, std::pair<V, V>> changed;

But hey, what if there was a map with a custom comparator?

template <class K, class V, class C>
std::map<K, std::pair<V, V>, C> 
diffMaps(std::map<K,V, C>& before, std::map<K,V, C>& after) {
  std::map<K, std::pair<V, V>, C> changed;
  C comp;
  //... 
  if (comp(idBefore, idAfter)) {  
    ++itBefore; 
  } 
  else if (comp(idAfter, idBefore)) {
  //...

And what about only comparing parts of two maps, i.e. iterator ranges instead of the whole container? The erase would not work then. And `multimap`s? One would have to compare the `equal_range`s of each key. And it’s completely not clear how that should be done, so maybe a custom comparator is needed for ranges with equal keys.

Solution

I stopped thinking further about it at that point. Not because I did not want to dig deeper into possible abstractions, but because the colleague who had presented me with the problem told me he did not want to have three different containers for deleted, added and changed elements.

The solution we used in the end was returning a `map<ID, tuple<char, Value, Value>>` which contained all information together. The char in each tuple would be one of `a`, `c`, `d`, `e` – indicating whether the association had been added, changed, deleted or was equal. That way a loop over the result with a simple switch on the char was possible without having to compare the values again. We simply inserted default-constructed values (an empty string, as far as I remember) as the “after” value for deletions and as “before” value for additions.

The algorithm was pretty similar to the one shown above, except that after one iterator has reached it’s container’s `end()`, there needs to be a loop for the rest of the other container to add the remaining deletions or additions to the result.

We did use a template, because my colleague was sure he did need the same algorithm for different keys and values later. However, we did not generalize for custom comparators nor for iterator ranges or multimaps.

Unneeded generalization hurts simplicity

I hope it is pretty clear that things can get complicated easily when you go over board generalizing an algorithm like the one above. The generalization for multimaps would have been a pain to implement and maintain, and even to use unless I had invested even more time in making the interface more convenient, maybe by providing wrapper functions for common cases.

While it is only a minor addition, the generalization for custom comparators reduces simplicity. The additional template parameter for the comparator is just boilerplate that reduces readability, and `idBefore < idAfter` is way easier to read than `comp(idBefore, idAfter)`, because we are simply used to operators and don’t have to keep in mind that `comp` is normally nothing more than that.

Even making the function a template should not be taken lightly. While the code is essentially the same, it has to be put into the header instead of the source file. That has implications on the compilation time and maybe on readers that might just skim the header to see which functions are available.

Nevertheless we decided to make our final solution a template, even though we did not need it right away. We did so because we know for sure we it will be needed it in the next days.

Generalizations has a cost. Don’t generalize a function because you have a hunch that “some day in the future” the generalization might be useful.

Generalizations and TDD

In Test Driven Development the credo is that one should write only what is needed to pass the tests. That implies that in the example above, no generalization should have been made, not even the template. That generalization would be only allowed when a second use case for the function appears and the code can be reused by generalizing it instead of writing a similar second function.

While I think it is a good idea that the code should be kept as simple as possible and only as complex as needed, I also think it is OK to think a bit ahead and prepare for the immediate future instead of having to revisit a utility function every few hours because a little generalization is needed. In my opinion, a pragmatic balance between the two extremes is the right thing to strive for.

Further reading

Gernot Klingler has written a related post that covers the broader topic of doing too much.

Previous Post
Next Post

4 Comments




  1. Great post! I have referred to this phenomenon as “the seduction of generality” and every programmer can hear those Sirens calling to them off the shore as they work on code.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *