Extend Your Language, Don't Alter It

Hirrolot's Blog

Jul 6, 2021

r/programming · r/rust · r/ProgrammingLanguages

Sometimes your programming language lacks a specific feature that will make your life easier. Perhaps language developers look upon it with a great deal of reluctance and skepticism, or are not going to implement it at all. But you need it, you need this feature right here and right now. What are you going to do then?

Generally, you have two approaches: first, you can continue living an utterly miserable and hopeless life without the feature, and second, you can implement the feature by means of some kind of meta-abstraction.

The first approach would introduce additional technical debt to your architecture. Given that sufficiently large codebases written in such languages as C already have a considerable amount of debt in them 1, this would make the situation even worse. On the other hand, the second approach, when you roll up your sleeves and implement a new linguistic abstraction by yourself, relieves much more potential: now you do not need to wait for a new language release with the desired functionality (which might never happen), and your codebase then going to suffer less because you are becoming able to express something important to sustain code elegance.

However, when you find yourself engineering a custom linguistic abstraction, you are in the position of a language designer. What it practically means is that the affairs can go especially tricky because your feature ought to fit well with all the other features your host language already has. In particular, the desired ability must look natural: it is when you feel like you continue programming in that general-purpose PL, but with the new feature added; it should not feel like an alien spacecraft fallen to Earth. In this post, I am to elaborate on the example of three PLs supporting user extension, C, Rust, and Common Lisp 2. I will show you how to extend the language, not to alter it.

Establishing the terminology

What do I mean by that gorgeous word, linguistic abstraction, to which I have referred two times in the introduction part? Basically, it is any language “feature”: a variable, an interface, a function. And guess what a metalinguistic abstraction means? Recall that the prefix “meta” simply means that the thing is used to deal with something of a similar nature: a metaprogram manipulates other programs, metadata describe other data, a metagrammar specifies other grammars, and so on. From this point we conclude that a metalinguistic abstraction is a linguistic abstraction used to deal with other linguistic abstractions. I am only aware of two types of them: code generation (or macros, or metaprogramming, whichever term you prefer 3) and a type system.

Why macros are “meta”? Well, macros can do pretty much anything with the code: they can accept it, they can transform it, they can emit the code… remember how Rust allows you to execute arbitrary instructions inside procedural macros, remember the incremental TT muncher pattern, or how you can brutally imitate type polymorphism through the C preprocessor (Patrick Pelissier, n.d.; Tyge Løvset, n.d.; Leonardo Vencovsky, n.d.). Whilst macros can manipulate other code, other code cannot manipulate macros 4 – this is the reason why macros are “meta”.

Why types are “meta”? What you usually accomplish with metaprogramming, you can leverage to enough expressive types 5. Returning back to our poor C preprocessor, in Rust you can simply use generics instead of instantiating type-specific code by hand. Or you can go insane and play with type lists instead of (ab)using compile-time macro collections of Boost/Preprocessor. So types are capable of metaprogramming to some extent (Alexis King, n.d.; Wikipedia, n.d.b; Haskell Wiki, n.d.; Will Crichton, n.d.b, n.d.a; Shea Leffler, n.d.b, n.d.a; Paho Lurie-Gregg and Andre Bogus, n.d.; Szymon Mikulicz, n.d.; Edwin Brady, n.d.) – this is why they are “meta”.

The presence of either of these tools in our language makes us able to extend it with custom concepts in an embedded manner, i.e., without intervention of third-party utilities. However, today I will discuss only the syntax business – macros.

Having the terminology established, let us dive into the pragmatics!

Syntactical consistency

datatype(
    BinaryTree,
    (Leaf, int),
    (Node, BinaryTree *, int, BinaryTree *)
);

int sum(const BinaryTree *tree) {
    match(*tree) {
        of(Leaf, x) return *x;
        of(Node, lhs, x, rhs) return sum(*lhs) + *x + sum(*rhs);
    }

    return -1;
}

What is this? This is how good old C looks like with Datatype99, a library that provides us with the full support for algebraic data types. Please pay your attention to the pattern matching syntax. Does it feel alright? Does it feel natural, like it has always been there? Absolutely. Now gaze upon this imaginary piece of code:

int sum(const BinaryTree *tree) {
    match(
        *tree,
        {
            of(Leaf, (x), return *x),
            of(Node, (lhs, x, rhs), return sum(*lhs) + *x + sum(*rhs)),
        });

    return -1;
}

I ask you the same question: does it feel alright? Does it feel natural, like it has always been there? Absolutely NOT. While it might look fine in another language, it looks utterly weird in C. But actually, what is the essential difference between these two code snippets, the difference that makes the former look properly, well-formedly, whereas the latter one look like a disformed creature? The syntactical consistency.

By syntactical consistency, I understand the degree by which the grammar of a particular meta-abstraction (e.g., the macros match & of) conforms to/coincides with the grammar of a host language. Recall that in C-like languages, we can often see constructions of the form <keyword> (...) <compound-statement> 6:

But we do not see

Got the pattern? The proper syntax of match coincides with the syntax of the host language, C in our case, whereas the latter one does not. Another example:

#define State_INTERFACE               \
    iFn(int, get, void *self);        \
    iFn(void, set, void *self, int x);

interface(State);

typedef struct {
    int x;
} Num;

int Num_State_get(void *self) {
    return ((Num *)self)->x;
}

void Num_State_set(void *self, int x) {
    ((Num *)self)->x = x;
}

impl(State, Num);

This time you see pure ISO C99 augmented with Interface99, a library that provides the software interface pattern. Notice that the function definition syntax remains the same (albeit iFn is somewhat less common), and impl just deduces these definitions (Num_State_get & Num_State_set) from the context. Now consider this:

impl(
    (State) for (Num),

    (int)(get)(void *self)({
        return ((Num *)self)->x;
    }),

    (void)(set)(void *self, int x)({
        ((Num *)self)->x = x;
    }),
);

This macro impl does not follow the syntax of C. This is why it looks so odd.

Both alternatives have the same semantics and the same functionality. The difference is only in the syntax part. Always try to mimic to the syntax of your host language, and you should be fine. Do not try to alter the common syntactical forms like a function/variable definition. This is what I call syntactical consistency 7.

The bliss of Rust: Syntax-aware macros

While C/C++ macros work with preprocessing tokens (ISO C, n.d.), Rusty macros work with concrete syntax trees, and sometimes with language tokens. This is cool because they let you imitate the syntax of Rust: you can parse function definitions, structures, enumerations, or pretty much anything! Consider tokio::select!:

tokio::select! {
    v1 = (&mut rx1), if a.is_none() => a = Some(v1.unwrap()),
    v2 = (&mut rx2), if b.is_none() => b = Some(v2.unwrap()),
}

See? The <something> => <something> syntax is much like native Rusty pattern matching. Because of it, this syntax looks very familiar, and even if you are not yet acquainted with the macro, you can already roughly understand what is happening.

Another example is derive macros of serde-json:

#[derive(Serialize, Deserialize)]
struct Person {
    name: String,
    age: u8,
    phones: Vec<String>,
}

Here, Serialize & Deserialize are indeed macros. They parse the contents of struct Person and derive the corresponding traits for it. You do not need to adjust the definition of the structure because the syntax is shared, and this is awesome. If I was designing a new language of mine and there was a need in macros, I would definitely endeavour to make them work nicely with the ordinary syntax.

The bliss of Lisp: Why S-expressions are so hot

“Are you quite sure that all those bells and whistles, all those wonderful facilities of your so-called powerful programming languages, belong to the solution set rather than the problem set?”

Edsger Dijkstra

The Rust’s syntax is not simple. As quite often happens in software engineering, a programming language grammar is a trade-off:

Citing David Tolnay:

“The macro author is responsible for the placement of every single angle bracket, lifetime, type parameter, trait bound, and phantom data. There is a large amount of domain knowledge involved and very few people can reliably produce robust macros with this approach.”

David Tolnay (David Tolnay, n.d.)

As opposed to Rust, we have a solution in a completely different direction – s-expressions. Instead of oversophisticating the language grammar by each subsequent release, some people decide to keep the grammar always trivial. This approach has a bunch of far-reaching implications, including simplified IDE support and language analysis in general. Metaprogramming becomes more malleable too, because you only need to handle a single homogenous structure (a list); you do not need to deal with an intimidating variety of syntactic forms your host language accomodates.

To come back to our muttons, the nature of s-expressions is to facilitate syntactical consistency. Consider this: if there are only s-expressions and nothing more, you can imitate any language item with simple macros – everything will look the same. Even with so-called “powerful” Rusty macros, we cannot do this:

delegate!(self.inner) {
    pub fn is_empty(&self) -> bool;
    pub fn push(&mut self, value: T);
    pub fn pop(&mut self) -> Option<T>;
    pub fn clear(&mut self);
}

The only way is to write like this:

delegate! {
    to self.inner {
        pub fn is_empty(&self) -> bool;
        pub fn push(&mut self, value: T);
        pub fn pop(&mut self) -> Option<T>;
        pub fn clear(&mut self);
    }
}

Adapted from Kobzol/rust-delegate, a library for automatic method delegation in Rust.

Clearly less nifty. The match control flow operator can do that, why your “powerful” macros cannot? Look:

let x = Some(5);
let y = 10;

// match!(x) { ...} ? Hmm...
match x {
    Some(50) => println!("Got 50"),
    Some(y) => println!("Matched, y = {:?}", y),
    _ => println!("Default case, x = {:?}", x),
}

Adapted from the chapter of TRPL about pattern matching.

Even if it could be fixed, this example does still greatly demonstrate the white holes of communication of the main syntax and user-defined macros in Rust: sometimes, due to its multifaceted grammar, it just does not allow us to express things naturally. One possible solution is leveraging an adaptive grammar:

“An adaptive grammar is a formal grammar that explicitly provides mechanisms within the formalism to allow its own production rules to be manipulated.”

Wikipedia (Wikipedia, n.d.a)

Basically what it means is that you can specify your own syntactic forms (like match or if) right inside a source file, and a built-in parser will do the trick. Idris supports the feature called syntax extensions, which is, to the best of my understanding, is pretty much like an adaptive grammar; believe or not, the if ... then ... else syntax is not built into the Idris compiler, but is rather defined via the ifThenElse function:

ifThenElse : (x:Bool) -> Lazy a -> Lazy a -> a;
ifThenElse True  t e = t;
ifThenElse False t e = e;

Which is invoked by the following syntactic rule:

syntax if [test] then [t] else [e] = ifThenElse test t e;

Similar syntactical constructions can be defined in the same way. No need to wait for a couple of years till language designers decide to ship a new release, do it right here and right now. Yes, you will be right if you say that Rust is extensible, but the thing is that its extensibility is still very limited 8, sometimes unpleasant 9.

Extending good old C

This is all exciting and fun, but how to apply this knowledge in practice? I have an answer. Rather a long answer, full of peculiar details and techniques.

I suggest you to start with a popular article of Simon Tatham about metaprogramming custom control structures in C 10. If you are only interested in a working solution, consider Metalang99 11 with its statement chaining macros. Seeing how pattern matching works in Datatype99 (Hirrolot, n.d.) can also give you some insight.

Final words

Some languages are more malleable to user extension than the others. Some employ adaptive grammars (Idris), some employ syntax-aware macros (Rust), some employ Lisp-style s-expressions. Surely, there are a lot of alternatives in the design space, each has its own benefits and downsides.

The intent of this blog post was to advocate the principle of syntactical consistency.

I encourage you to mimic the syntax of a host language when writing macros, to make your code look more eye-pleasing, less like a malevolent beast.

I encourage you to extend, not to alter.

References

Alexis King. n.d. “An Introduction to Typeclass Metaprogramming.” https://lexi-lambda.github.io/blog/2021/03/25/an-introduction-to-typeclass-metaprogramming/.
David Tolnay. n.d. “Compile-Time Reflection API for Developing Robust Procedural Macros (Proof of Concept).” https://github.com/dtolnay/reflect.
Edwin Brady. n.d. Type-Driven Development with Idris. https://www.manning.com/books/type-driven-development-with-idris.
Haskell Wiki. n.d. “GADTs for Dummies.” https://wiki.haskell.org/GADTs_for_dummies.
Hirrolot. n.d. “Compiling Algebraic Data Types in Pure C99.” https://hirrolot.github.io/posts/compiling-algebraic-data-types-in-pure-c99.html.
ISO C. n.d. “C99 | 6.4 Lexical Elements.” http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf.
Leonardo Vencovsky. n.d. “Easy to Use, Header Only, Macro Generated, Generic and Type-Safe Data Structures in C.” https://github.com/LeoVen/C-Macro-Collections.
Paho Lurie-Gregg, and Andre Bogus. n.d. “Compile Time Numbers in Rust.” https://github.com/paholg/typenum.
Patrick Pelissier. n.d. “Generic Type-Safe Container Library for C Language.” https://github.com/P-p-H-d/mlib.
Shea Leffler. n.d.a. “A Macro for Defining Type Operators in Rust.” https://github.com/sdleffler/type-operators-rs.
———. n.d.b. “Rust’s Type System Is Turing-Complete.” https://sdleffler.github.io/RustTypeSystemTuringComplete/.
Szymon Mikulicz. n.d. “Compile-Time Compiler That Compiles Forth to Compile-Time Trait Expressions.” https://github.com/Ashymad/fortraith.
Tyge Løvset. n.d. “Standard Template Containers for C.” https://github.com/tylov/STC.
Wikipedia. n.d.a. “Adaptive Grammar.” https://en.wikipedia.org/wiki/Adaptive_grammar.
———. n.d.b. “Higher-Order Abstract Syntax Through GADTs.” https://en.wikipedia.org/wiki/Generalized_algebraic_data_type#Higher-order_abstract_syntax.
Will Crichton. n.d.a. “Implementing a Type-Safe Printf in Rust.” https://willcrichton.net/notes/type-safe-printf/.
———. n.d.b. “Type-Level Programming in Rust.” https://willcrichton.net/notes/type-level-programming/.

  1. “Any sufficiently complicated C or Fortran program contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.” – Greenspun’s tenth rule, an amusing quote with which I do agree a lot.↩︎

  2. I wish I had more experience with limitless Lisp-style user extension, as in Racket or Common Lisp. Maybe my post would have been more profound then.↩︎

  3. There is also runtime reflection, though I am not sure whether it is a special kind of metaprogramming or not. Maybe JIT macros could outperform Java-style runtime reflection? Oh my god, that is insane…↩︎

  4. Unless macros generate other macros! C/C++ macros cannot do that, while Rusty and Lispy macros can.↩︎

  5. It might not be formally correct… if your metaprogramming system is Turing-complete, how would you leverage everything to a logically consistent type system, as of Idris? Surely, this is out of the scope of this blog post.↩︎

  6. A compound statement is a sequence of statements put into curly braces.↩︎

  7. C99-Lambda is yet another terrifying example of abusing the preprocessor. It attempts to alter the native function definition syntax, and therefore it looks so odd.↩︎

  8. On the other hand, limitless extensibility tied up with a complicated syntax would make mean developers mess up with reliable macros again. What a disappointment! Returning back to s-expressions…↩︎

  9. For example, the syntax of match is gradually evolving over time. Not so long time ago the core team has announced “or” patterns. With an adaptive grammar, this feature could be implemented in libraries.↩︎

  10. To the best of my knowledge, Simon Tatham was first to formulate the term statement prefix. He described precisely how to build custom control flow operators via regular #define macros, and make them look natural.↩︎

  11. Metalang99 is an advanced metaprogramming system for C99. It is implemented as a purely functional programming language, with partial applications, recursion, algebraic data types, cons-lists, and all the stuff.↩︎