Domain-specific languages (DSLs) can be powerful tools to simplify certain aspects of programming. While DSLs can be used in most or all programming languages, I think there are a few aspects that make the implementation and use of DSLs especially rewarding in C++.
What is a DSL?
I won’t dwell on the topic, I’ll just drop the definition of Martin Fowler‘s great book Domain Specific Languages:
A computer programming language of limited expressiveness on a particular domain.
In this definition, the term “limited expressiveness” and the domain focus sets a DSL apart from a general purpose language. The term “language” sets it apart from a mere API, so the use of a DSL reads more fluently than just a few statements lined after another.
DSLs can be divided into two major categories: embedded and external DSLs. Embedded DSLs are written in the host language, i.e. in our case, it would be some kind of special C++ code. External DSLs are usually plain text languages that have to be parsed and interpreted or even compiled.
If you want to know more about how DSLs work and how they can be implemented, I strongly suggest you read Martin Fowler’s book. It’s a must-read.
C++ and embedded DSLs
Embedded DSLs are easier to get started with than external DSLs because you can achieve some pretty expressive stuff without having to go through all the plain text processing.
Since the bits and pieces of an embedded DSL are constructs of the host language (i.e. C++), the compiler has to do the main work in parsing and translating it to function calls. All we have to do is giving those functions a meaning.
A well known example for an embedded DSL is part of some unit test frameworks. In such a DSL you would write preconditions, actions and the postconditions that you want to test like this:
This is valid C++ code, if the needed functions exist. It is readable and the fluency that qualifies those functions as a DSL is apparent.
However, that line is valid Java or C# code, too. So what is special about C++ for embedded DSLs? I think there are two features that stand out, especially if they are combined: Operator overloading and templates.
If you do it right, you can overload a few operators and give them a completely new meaning, building a readable embedded DSL. You are only limited by the language syntax, and with over 40 overloadable operators there is a lot to play with.
Together with templates they can get very powerful, for example you can build expression templates, and then analyze them with what would be the interpreter for the DSL.
A simple example
Here is a sketchy example of an embedded DSL that I have once written, using only a few operators and a handful of functions:
Consider a tree, consisting of relatively simple nodes. Each node carries a node type and an ID. In our program, we frequently needed to know whether there was a top-down path in that tree with certain nodes.
If there was a matching path, we wanted to extract (save a reference to) some of the node IDs and, for some nodes, some kind of annotation. We could not simply list each node in a path, because sometimes there could be unknown nodes between two known nodes, so we had to find a notation for optional “gaps of unknown”.
Here’s an example of such a path:
Nd(X, "foo") > *Nd(Y) >> *Nd(A, "bar")[annot] > Nd(B)
The meaning of this short piece of code is:
- Find a node of type X with ID “foo” (`Nd(X, “foo”)`)
- Find a directly following (`>`) node of type Y, with any ID, and extract it (`*`).
- Find a node of type A and ID “bar”, somewhat further down the tree (`>>` denotes a “gap”)
- Extract that node, and annotate (``) it with a certain object (`annot`)
- This node has to be directly followed by a node of type B
How it works
The expression above creates an expression template object, containing four
NodeInfo objects which contain what the interpreter has to look for (node types and IDs) and what it has to do with the nodes it finds (extraction and annotations).
Due to C++’s operator precedence, the compiler interprets the above code like this:
Nd(X, "foo") > ( *Nd(Y) >> *Nd(A, "bar")[annot] ) > Nd(B) ^--- stronger precedence of >> --^
However, since in our DSL the two operators are meant to have the same precedence and the evaluation order has to be left to right in the interpreter, some template programming hacks result in an expression template of the type
Sequence<Node, GapSequence<Node, Sequence<Node, Node>>>.
In other words, it’s the same as if there were parenthesis to form a proper head-tail structure:
Nd(X, "foo") > ( *Nd(Y) >> ( *Nd(A, "bar")[annot] > Nd(B) ) )
It takes a bit of getting used to read and write those paths, and no wonder, after all it is a domain specific language on its own one has to learn, and while it is valid C++ syntax it has completely different semantics than the garden-variety C++ code we are used to.
But you get very concise and easy to maintain code, compared to doing the searching, extracting and annotation by hand each time. All that stuff has to be implemented only once, inside the interpreter, so you have little chance to ever do that wrong again.
So, C++ is really good for building embedded DSLs. However, it is not bad for external DSLs either, which I will write about next time.
I find using operator overloading in CXX is a pretty nasty thing. Knowing the rules how operators in C and CXX behave with the standard types is hard enough for most people. Defining an arbitrary set of rules makes program behaviour even less obvious. I have used SystemC, a hardware modelling Library written in CXX a while ago. It overloaded nearly every operator. Development with the framework was like learning an entirely new language. Of course this language, or DSL if you wish, inherited all the pitfalls of CXX and TMP as well. IMHO Creating an actual language would have been the better choice.
Hi Martin, thanks for your comment! Operator overloading can get nasty, if you overdo it. Using operator overloading to build an embedded DSL is a compromise and has its tradeoffs like everything else in programming. If an embedded DSL does not make the use of a framework or the expressiveness in a problem domain simpler, it has failed. However, there are situations that are complex or repetitive by nature, where a DSL can help. In those cases, learning a small language can prove more viable than learning a complicated API. If an embedded DSL gets too complicated and inherits problems from the host language like in your case, an external DSL might indeed be the better alternative.
There is no general guideline or rule when or if embedded DSLs are a good or bad solution, it has to be decided on a case to case basis.