Contributing an mdoc reader to Pandoc

Evan Silberman

Pandoc is my favorite piece of software. Unix manual pages are my favorite form of software documentation. My favorite manual pages are from OpenBSD, and they’re written in the mdoc macro language. So I contributed support for mdoc input to Pandoc. I sat down one day in front of a blank page, typed a few things, and decided it was probably beyond my capabilities to work on. Then the next day I started over and then I pretty much didn’t stop until I finished. The mdoc reader shipped with Pandoc 3.6 on December 7, 2024. This is how I got there.

The nature of this essay is that I have to talk about three distinct things with names that end in “doc”, so here is a quick reference you can jump back to if you aren’t already familiar with them all:

Pandoc is a program that can convert between several dozen markup languages and document formats.
mdoc is a markup language designed for marking up manual pages, with semantic markup specific to that domain.
mandoc is a program that formats manual pages in the mdoc and man formats for display in various output formats.

Nobody’s confused, right?

(btw there’s a feed for these articles now if you want that)

There’s a few good reasons to add mdoc support to Pandoc, beyond it being an interesting thing for me to work on. People like to single-source their software documentation to avoid duplication and drift, and Pandoc is a helpful tool for that given its wide range of supported input and output formats. I believe that the manual page is the best documentation format for software on Unix-like systems and that mdoc is the superior macro language for manual pages. Even when newly-written software does provide manual pages (which is less common than I would like), they’re often still written in the man language, despite mdoc being supported everywhere, semantic rather than presentational, and much easier to write with. My hope is that by shipping the mdoc reader in Pandoc, I’ll help motivate some people to write manual pages for their software that they can transform into other formats with the power of Pandoc.

My first big contribution to Pandoc was a writer (output format) for terminal output formatted with ANSI escapes, including OSC 8 for clickable links. Working on the ANSI writer required me to do some interesting stuff in Pandoc’s doclayout dependency, modifying it to support text ranges with styles applied and resolving some performance and correctness regressions that cropped up in my first attempts. Once I did that, though, the writer itself was straightforward. Pandoc represents documents as a tree of block-level and inline-level elements, and all a Pandoc writer really has to do is recursively output the elements, making aesthetic decisions along the way. A Pandoc writer doesn’t have to exploit every capability of the target format, it just has to represent a Pandoc document well. My mdoc reader had to cover much more surface area: my job was to parse the entire mdoc language and coerce it into Pandoc’s document model.

This of course poses the question of what the mdoc language even is, ontologically speaking. mdoc is one of the two roff macro packages commonly used to write Unix manuals, alongside the earlier man. The mdoc troff macros first appeared in 4.4BSD, whence they were abosrbed into various divergent projects. The GNU roff (groff) project imported the mdoc macros from BSD in version 1.02; they were eventually rewritten, with a couple of new features. In 2008, the mandoc (née mdocml) project began reimplementing mdoc parsing and rendering from scratch in C—mandoc doesn’t rely on roff macros at all to interpret mdoc.

So, today, mandoc and groff are the main relevant implementations of mdoc.¹ These two implementations try to stay broadly compatible with each other, and the mdoc language itself evolves minimally. Both projects document that language, as mdoc(7) and groff_mdoc(7). While it would’ve been conscientious of me to refer to both mandoc and groff during my development process, I selfishly decided to use mandoc as the exclusive reference implementation. Tracking against both implementations might’ve turned up edge cases where they disagree on how to render something, but I was determined not to expend excesses of energy on edge cases to begin with.

Even having punted consensus with groff to the “someday” list, there were lacunae in mandoc’s documentation that I struggled to interpret. In some cases like these, and in some borderline-nonsensical markup found in mandoc’s excellent regression tests, I made judgment calls and accepted minor deltas from what mandoc does.

When you parse a markup language that humans write documents in, there’s a lot of pressure on you to adhere to the robustness principle. Browser engines devote (I assume) tens of thousands of lines of code to dealing with malformed HTML. A browser will attempt to recover from pretty much any nesting issue and render something. Mandoc, likewise, will hardly ever (perhaps never) give up and die when you give it malformed input. For instance, you can have bad nesting in quote-like macros and mandoc will tell you it’s wrong (with mandoc -Tlint), but it will also do something more or less reasonable with what you wrote and close out the runaway argument. I decided early on in my work that I didn’t want to be that permissive, at least not off the bat. If mandoc would issue a warning for a particular combination of macros, I gave myself the privilege of declaring that undefined behavior. This decision probably saved me a lot of time scrutinizing how my reader handles markup unlikely to appear in nature.

Pandoc has been around for what is starting to count as “a long time” in the software world, and has evolved in various directions and grown by accretion in that time. Of the readers (supported input formats) created by its principal author, John MacFarlane, a couple of the most recent ones (such as for Djot and Typst) delegate a lot of the work to separate Haskell packages that are responsible for parsing the source text into a format-native data type, and then the Pandoc reader “just” translates that data into Pandoc’s document model. I found that appealing because I thought it would be nice to have reusable code that can represent the full richness of mdoc and have that be available for applications outside of Pandoc.

I ran into trouble with this plan pretty much instantaneously. I hadn’t really programmed with parser combinators before, I didn’t know what kind of structure I wanted to parse documents into, I just had no obvious way to get started. This was pretty demoralizing, and after writing several dozen lines of code that did nothing and amounted to nothing I concluded I wasn’t capable of doing it. The next day I decided to focus on my actual end goal of adding an mdoc reader to Pandoc and work on it in that context, parsing directly to Pandoc’s document model. This proved infinitely more approachable.

Most of Pandoc’s readers don’t have a separate tokenization step. Lexing seems to be less useful in general to the programmer with Parsec-style parsing, which also seems to be true of the slightly more formalized parsing expression grammars. It wound up being convenient to tokenize mdoc before parsing it, though it’s possible I only did so because I started my work by duplicating code from Pandoc’s man reader, which also has a lexer. I think it did make my parsers a bit easier to write by lexing mdoc into MdocTokens, because it took care of escaping and of distinguishing macro calls from literal arguments.

One definite advantage of having a lexing step was that I could process roff escape sequences before parsing any structure. Escaping for roff-based languages was part of Pandoc’s existing roff lexer, which is used by the man reader. I didn’t want to reuse the roff lexer wholesale, because of Reasons, but I knew it would be preposterous to duplicate the escape handling, so I figured out a way to reuse the escaping functions with my mdoc token type as the output and with some features effectively disabled. Polymorphism in Haskell means you write a typeclass, and this typeclass had to come with some associated type families so that the escaping functions could return different token types depending on what lexer they were used with, and those type families had to be injective type families, and I don’t really remember what that means or how I figured out that that was the Haskell thingy that I needed. It works though! Let’s move on!

Pandoc’s built in readers are built using Parsec, a monadic parser combinator library. This is a fancy way of saying it lets you write recursive descent parsers while keeping track of the input for you and making it easy to do arbitrary backtracking and lookahead. And “recursive descent” is a fancy way of saying “just going for it”. Parsec-style parsers don’t require you to formally state the grammar of your language, you can use as much extra state as you want, and code reuse is straightforward.

Honestly I don’t think there’s anything of particular interest for me to say about actually implementing the mdoc parser in Pandoc. A certain amount of wizard-contemplating-an-orb-type mystique surrounds parsing. At the stage where I was considering writing a standalone mdoc module in Haskell, I spent some time looking at the docs for alex and happy, which are lex and yacc analogues for Haskell. While I roughly understood what I was looking at, I still had the issue that they seemed like tools I couldn’t easily use without knowing upfront what I was doing.

Parsec’s virtue is that I didn’t really have to know what I was doing to get good results. I had to parse a lot of macros, and I got there by starting with what was obvious (match a macro, then consume its arguments, then output a Pandoc element) and gradually supporting non-obvious things as I went along. One thing that I struggled to get in my head, and had to debug from scratch several times in different places, was that alternative parsers don’t backtrack by default if they fail after consuming some input and you have to insert try if a parser and its alternatives potentially share a prefix.

Another reason I could get away without exactly knowing what I was doing is that I wasn’t parsing into a structure of my own design, but into Pandoc’s AST, so there was a whole design task I didn’t have to iterate on. I didn’t have to decide on/design a parse for mdoc that made sense for a variety of applications, I just had to represent the mdoc source adequately for Pandoc. Pandoc’s restricted AST for representing documents is one of its virtues: it’s only because of the “narrow waist” of the Pandoc document model that it can produce reasonable conversions between any source and target format. The manual says “[b]ecause pandoc’s intermediate representation of a document is less expressive than many of the formats it converts between, one should not expect perfect conversions between every format and every other.” This is very much the case with mdoc, which has rich semantics specialized to the task of writing technical documentation for Unix utilities and C libraries. But it’s possible (and common) for Pandoc readers to preserve a little bit extra richness from the source language using element attributes and generic Div and Span elements. In the mdoc reader, I take a cue from mandoc’s HTML output mode, as seen for example on OpenBSD’s and Void Linux’s man page sites. There’s a lot of mdoc macros (e.g. Li, Cm, Ic) that get mapped to Pandoc Code elements, but with the original macro name added to the Code’s class attribute. This is not presently documented, but a lot of the specifics of how Pandoc decorates its syntax tree are more or less fun bonuses for the attentive user.

Despite my long(ish) experience as a professional computer-toucher, I’m not exceptionally confident with automated testing. In the database-backed web application domain, I’ve never been able to use things like mocks in a way that left me feeling like I was actually testing something, as opposed to constructing an elaborate topology. Testing pure code is much easier, and Pandoc has large and varied test suite covering its various components. Relatively early in development I found it easy to start populating a test suite for the mdoc reader that compared the result of running the reader on a snippet of code to the intended Pandoc AST. I snagged some test cases from mandoc’s regression tests, others I contrived from my imagination or based on particular bugs I had caught. Pandoc also has a bunch of tests called “command tests” that compare the intended and actual output of running a particular Pandoc command line; I wrote some of these for test inputs that were more than a few lines.

Testing how macros parse in isolated cases is good but this is a document language and the real evidence of success is whether we can parse actual documents with it. In keeping with my scoping decisions, I wanted a test corpus that would exercise my parser with as much reasonable markup as possible and as little unreasonable markup as possible. Naturally I decided to collect all the man pages written in mdoc shipped in OpenBSD’s base system and use them as my test corpus. Once I had implemented enough of mdoc to parse simple manual pages, I started running my reader on increasingly large subsets of the corpus and outputting the stderr to files. Then I could grep across all these outputs looking for common issues. Mostly this was useful as an ad hoc way to work down the list of unimplemented macros, but I also caught subtler issues this way, like parsing opening delimiters at the end of certain macro lines. As I made progress, I had the satisfaction of seeing the total number of skipped macros and parse failures decrease. Ultimately, I decided I had covered enough ground once I was seeing parse errors on just over a dozen of the 3500 man pages I tested against.

When I shared my work on this in a group chat, a friend said “this is so much above my pay grade” and called it “real engineering shit”.² And I don’t think that’s true! It took a lot of hours and for non-essential reasons I had to smear some less-elementary Haskell goo on things, but at no point did I have to ascend to a higher plane of parsing consciousness. My code probably has some special cases that could be folded back into general ones, and some general cases that are wrong, and some backtracking that could be eliminated, and certainly isn’t ideally organized for reading and maintenance. Which is not to say the code is bad or anything, or to bag on my own skills, but I didn’t bring any special expertise to the work. I just wanted to do it, had time to do it, and managed not to give up.

But saying “just” in that sentence obscures that motivation, time, and persistence are exactly the things that frequently elude me when I want to do a personal project. I only had the time to do this project because I was at most semi-employed during that time, and had no urgent need to get paying work. Persistence has been a problem for my whole adulthood; I hate to overburden the “ex-gifted-kid” cliché but I never had a need to “work hard” until college, when I eventually learned that in the end it’s my own standards that are hardest to meet. And a neat idea only suffices for a little bit of motivation, because it can be pretty satisfying to imagine what it would be like if I did something. I couldn’t say where the motivation to actually finish comes from, some occult psychological process, but I did finish, and the thing exists outside my imagination, and that’s pretty good.

If the ANSI writer and mdoc reader are useful to you and you want to give me a financial incentive to keep hacking on Pandoc, you can support me on GitHub Sponsors!

The heirloom doctools are a third line of development.↩︎
This same friend has been working on a synthesizer project running on ESP32 with a lot of input and output hardware, and to me that is “real engineering shit” that I wouldn’t have the patience for. I lack the fine motor control required for soldering, to be honest.↩︎