Formatting with first-class types

It all started with an innocuous snippet of Idris posted on Twitter. But before we get into details, we need to talk about format strings and first-class types.

Format strings are an archetypal example demonstrating dependent typing. It’s a simple but telling use case. In most languages we program in we don’t have access to first-class types. As a result, we have to rely on various tools to help us program with format strings. Those tools may be embedded in our compilers, IDEs, or other analyzers. In any case, they are not intrinsic to the language.

Here, for instance, a C compiler is warning us of an invalid call to printf.

$ cat printf.c
#include <stdio.h>

int main() {
  return printf("%i\n");
}

$ gcc printf.c
printf.c: In function ‘main’:
printf.c:4:19: warning: format ‘%i’ expects a
               matching ‘int’ argument [-Wformat=]
   return printf("%i\n");
                  ~^

That error got caught at compile time. Unfortunately, sometimes the feedback loop is not as tight. Let’s consider this piece of Java interacting with a SQL database.

import java.sql.*;

PreparedStatement statement =
  conn.prepareStatement("update rockets \
                         set launched_at = ? \
                         where id = ?");
statement.setInt(4, 12345);
statement.execute();

The code is invalid: there’s no fourth argument, and the two which remain are never set. In spite of errors, the compiler accepts the file without complaints. Unless our IDE warns us earlier, we have to count on feedback from our test suite or static code analysers like SonarJava.

In a language with first-class types the situation changes. Types and the number of arguments to a function like printf can be determined at compile time based on its first parameter: the format string.

Idris is a language with first-class types. In fact, the format string problem appears as an example in Edwin Brady’s Type-Driven Development with Idris. I can’t recommend this book enough; get a copy if you have a chance.

The first five chapters of the book offer a gentle and unassuming introduction to the language. In chapter six the author leads us through a process of defining a printf function which will fail to compile unless arguments match the specification in the format string. We can experiment with the printf in the REPL:

Idris> :t printf "%d is a whole number"
printf "%d is a number" : Int -> String
Idris> printf "%d is a whole number" 3
"3 is a whole number" : String
Idris> printf "%d is a whole number" 0.1
builtin:Type mismatch between
        Double (Type of 0.1)
and
        Int (Expected type)

What’s important to note is that the rules you see above are not embedded into the language, the standard library, or the compiler. We specify them ourselves by using Idris' type system. How? Let’s take a look, loosely following the solution from the book.

The implementation begins with a definition of a type representing our format string. We will limit ourselves to formats with %d and %s specifications.

data Format = Number Format
            | Str Format
            | Lit String Format
            | End

We represent a format string "%d is a %s" as a value Number (Lit " is a " (Str End)). Specifically:

it begins with a number passed as an argument to printf,
then it continues with a literal constant string " is a ",
then it expects a string passed as another argument to printf, and
it ends.

Given a Format like the one above we want to compute the type of our printf function. We need a function for this. We’ll call it PrintfType, and its implementation might look as follows.

PrintfType : Format -> Type
PrintfType (Number fmt)  = Int -> PrintfType fmt
PrintfType (Str fmt)     = String -> PrintfType fmt
PrintfType (Lit str fmt) = PrintfType fmt
PrintfType End           = String

If the format is a Number, expect an Int argument and continue computing the type. If it’s a Str, expect a String argument. If it’s a literal, no argument is necessary but we still need to continue parsing; the rest of the format might need some arguments. Finally, at the end of the format we end up with a regular String.

Using the REPL, we can verify that the type for our format is correct. It’s a type describing a function, taking an Int and a String and returning a String.

Idris> PrintfType (Number (Lit " is a " (Str End)))
Int -> String -> String : Type

Now we need a way of parsing our format string into its Format representation.

stringToFormat : String -> Format
stringToFormat str = toFormat (unpack str)

Using unpack we transform the given string into a list of characters. Such a list is much easier to process and pattern-match against than a normal string. After unpacking we parse the list using the following toFormat function.

toFormat : List Char -> Format
toFormat [] = End
toFormat ('%' :: 'd' :: rest) = Number (toFormat rest)
toFormat ('%' :: 's' :: rest) = Str (toFormat rest)
toFormat (char :: rest) = lit rest [char]
  where
    lit : List Char -> List Char -> Format
    lit [] seen = Lit (pack seen) End
    lit ('%' :: cs) seen = Lit (pack seen)
                               (toFormat ('%' :: cs))
    lit (c :: cs) seen = lit cs (seen ++ [c])

Let’s go through all the top-level cases.

If no characters are left, we’re done and return End.
If the list begins with "%d", we parse it as a Number.
If it begins with "%s" — it’s a Str.
Otherwise, we read a literal string using the lit helper function.

lit keeps reading from its first argument and accumulates read characters in the second one. Upon encountering a '%' we know we have a complete literal. Using pack — the inverse of unpack — we transform what we have back into a string. We return the string in a Lit, with the rest of parsing being handled by a call to toFormat.

Let’s try it out by parsing a string into a Format and then computing its PrintfType.

Idris> stringToFormat "%d is a %s"
Number (Lit " is a " (Str End)) : Format
Idris> PrintfType (stringToFormat "%d is a %s")
Int -> String -> String : Type

Great! Now it’s time to wrap it up by declaring the type of printf.

printf : (fmt : String) -> (PrintfType (stringToFormat fmt))

Can you see? printf takes a String as its first argument and the rest of its signature is computed from the first argument. Let’s inspect it in the REPL!

Idris> :t printf "hello"
printf "hello" : String
Idris> :t printf "hello, %s"
printf "hello, %s" : String -> String
Idris> :t printf "%s costs %d€"
printf "%s costs %d\8364" : String -> Int -> String

The implementation of the printf function remains an open topic, but we’ve solved the most interesting part of the problem. Code using our printf will fail to compile if our arguments don’t match the format string. Using exactly the same approach we could bullet-proof our functions for building SQL statements.

At this point, I can’t recommend Type-Driven Development with Idris enough. In chapter six you will find a far more precise discussion of printf’s implementation, including an alternative definition of toFormat.

Having an understanding of how printf works, we are well-equipped to attack the next problem: format strings with precision specifications, such as "%0.3f".

We’ll also feature the aforementioned snippet from Twitter. But for now, let’s take a break, have a walk outside, and let what we saw sink in. We’ll be back soon. Until then!

Blog-Post

Formatting with first-class types

Stringly-typed APIs under strict supervision.

TAGS