Contributing an mdoc reader to Pandoc
Pandoc is my favorite piece of
software. Unix manual pages are my favorite form of software
documentation. My favorite manual pages are from OpenBSD, and they’re
written in the mdoc
macro language. So I contributed
support for mdoc
input to Pandoc. I sat down one day in
front of a blank page, typed a few things, and decided it was probably
beyond my capabilities to work on. Then the next day I started over and
then I pretty much didn’t stop until I finished. The mdoc
reader shipped with Pandoc 3.6 on December 7, 2024. This is how I got
there.
The nature of this essay is that I have to talk about three distinct things with names that end in “doc”, so here is a quick reference you can jump back to if you aren’t already familiar with them all:
- Pandoc is a program that can convert between several dozen markup languages and document formats.
- mdoc is a markup language designed for marking up manual pages, with semantic markup specific to that domain.
- mandoc is a program that formats manual pages in
the
mdoc
andman
formats for display in various output formats.
Nobody’s confused, right?
(btw there’s a feed for these articles now if you want that)
There’s a few good reasons to add mdoc
support to
Pandoc, beyond it being an interesting thing for me to work on. People
like to single-source their software documentation to avoid duplication
and drift, and Pandoc is a helpful tool for that given its wide range of
supported input and output formats. I believe that the manual page is
the best documentation format for software on Unix-like systems and that
mdoc
is the superior macro language for manual pages. Even
when newly-written software does provide manual pages (which is less
common than I would like), they’re often still written in the
man
language, despite mdoc
being supported
everywhere, semantic rather than presentational, and much easier to
write with. My hope is that by shipping the mdoc
reader in
Pandoc, I’ll help motivate some people to write manual pages for their
software that they can transform into other formats with the power of
Pandoc.
My first big
contribution to Pandoc was a writer (output format) for terminal
output formatted with ANSI escapes, including OSC 8 for clickable links.
Working on the ANSI writer required me to do some interesting stuff in
Pandoc’s doclayout
dependency, modifying it to support text ranges with styles applied and
resolving some performance and correctness regressions that cropped up
in my first attempts. Once I did that, though, the writer itself was
straightforward. Pandoc represents documents as a tree of block-level
and inline-level elements, and all a Pandoc writer really has to do is
recursively output the elements, making aesthetic decisions along the
way. A Pandoc writer doesn’t have to exploit every capability of the
target format, it just has to represent a Pandoc document well. My
mdoc
reader had to cover much more surface area: my job was
to parse the entire mdoc
language and coerce it into
Pandoc’s document model.
This of course poses the question of what the mdoc
language even is, ontologically speaking. mdoc
is
one of the two roff macro packages commonly used to write Unix manuals,
alongside the earlier man
. The mdoc
troff
macros first appeared in 4.4BSD, whence they were abosrbed into various
divergent projects. The GNU roff (groff) project imported
the mdoc
macros from BSD in version 1.02; they were eventually
rewritten, with a couple of new features. In 2008, the mandoc (née mdocml) project began
reimplementing mdoc
parsing and rendering from scratch in
C—mandoc doesn’t rely on roff macros at all to interpret
mdoc
.
So, today, mandoc and groff are the main relevant implementations of
mdoc
.1 These two implementations try to
stay broadly compatible with each other, and the mdoc
language itself evolves minimally. Both projects document that language,
as mdoc(7) and groff_mdoc(7).
While it would’ve been conscientious of me to refer to both mandoc and
groff during my development process, I selfishly decided to use mandoc
as the exclusive reference implementation. Tracking against both
implementations might’ve turned up edge cases where they disagree on how
to render something, but I was determined not to expend excesses of
energy on edge cases to begin with.
Even having punted consensus with groff to the “someday” list, there were lacunae in mandoc’s documentation that I struggled to interpret. In some cases like these, and in some borderline-nonsensical markup found in mandoc’s excellent regression tests, I made judgment calls and accepted minor deltas from what mandoc does.
When you parse a markup language that humans write documents in,
there’s a lot of pressure on you to adhere to the robustness principle.
Browser engines devote (I assume) tens of thousands of lines of code to
dealing with malformed HTML. A browser will attempt to recover from
pretty much any nesting issue and render something. Mandoc,
likewise, will hardly ever (perhaps never) give up and die when you give
it malformed input. For instance, you can have bad nesting in quote-like
macros and mandoc will tell you it’s wrong (with
mandoc -Tlint
), but it will also do something more or less
reasonable with what you wrote and close out the runaway argument. I
decided early on in my work that I didn’t want to be that permissive, at
least not off the bat. If mandoc would issue a warning for a particular
combination of macros, I gave myself the privilege of declaring that
undefined behavior. This decision probably saved me a lot of time
scrutinizing how my reader handles markup unlikely to appear in
nature.
Pandoc has been around for what is starting to count as “a long time”
in the software world, and has evolved in various directions and grown
by accretion in that time. Of the readers (supported input formats)
created by its principal author, John MacFarlane, a couple of the most
recent ones (such as for Djot and Typst) delegate a lot of the work to
separate Haskell packages that are responsible for parsing the source
text into a format-native data type, and then the Pandoc reader “just”
translates that data into Pandoc’s document model. I found that
appealing because I thought it would be nice to have reusable code that
can represent the full richness of mdoc
and have that be
available for applications outside of Pandoc.
I ran into trouble with this plan pretty much instantaneously. I
hadn’t really programmed with parser combinators before, I didn’t know
what kind of structure I wanted to parse documents into, I just had no
obvious way to get started. This was pretty demoralizing, and after
writing several dozen lines of code that did nothing and amounted to
nothing I concluded I wasn’t capable of doing it. The next day I decided
to focus on my actual end goal of adding an mdoc
reader to
Pandoc and work on it in that context, parsing directly to Pandoc’s
document model. This proved infinitely more approachable.
Most of Pandoc’s readers don’t have a separate tokenization step.
Lexing seems to be less useful in general to the programmer with
Parsec-style parsing, which also seems to be true of the slightly more
formalized parsing expression grammars. It wound up being convenient to
tokenize mdoc before parsing it, though it’s possible I only did so
because I started my work by duplicating code from Pandoc’s
man
reader, which also has a lexer. I think it did make my
parsers a bit easier to write by lexing mdoc into
MdocTokens
, because it took care of escaping and of
distinguishing macro calls from literal arguments.
One definite advantage of having a lexing step was that I could
process roff
escape sequences before parsing any structure. Escaping for
roff-based languages was part of Pandoc’s existing roff lexer, which is
used by the man
reader. I didn’t want to reuse the roff
lexer wholesale, because of Reasons, but I knew it would be preposterous
to duplicate the escape handling, so I figured out a way to reuse the
escaping functions with my mdoc
token type as the output
and with some features effectively disabled. Polymorphism in Haskell
means you write a typeclass, and this typeclass had to come with some
associated type families so that the escaping functions could
return different token types depending on what lexer they were used
with, and those type families had to be injective type
families, and I don’t really remember what that means or how I
figured out that that was the Haskell thingy that I needed. It works
though! Let’s move on!
Pandoc’s built in readers are built using Parsec, a monadic parser combinator library. This is a fancy way of saying it lets you write recursive descent parsers while keeping track of the input for you and making it easy to do arbitrary backtracking and lookahead. And “recursive descent” is a fancy way of saying “just going for it”. Parsec-style parsers don’t require you to formally state the grammar of your language, you can use as much extra state as you want, and code reuse is straightforward.
Honestly I don’t think there’s anything of particular interest for me
to say about actually implementing the mdoc
parser in
Pandoc. A certain amount of wizard-contemplating-an-orb-type mystique
surrounds parsing. At the stage where I was considering writing a
standalone mdoc
module in Haskell, I spent some time
looking at the docs for alex and happy, which
are lex and yacc analogues for Haskell. While I roughly understood what
I was looking at, I still had the issue that they seemed like tools I
couldn’t easily use without knowing upfront what I was doing.
Parsec’s virtue is that I didn’t really have to know what I was doing
to get good results. I had to parse a lot of macros, and I got there by
starting with what was obvious (match a macro, then consume its
arguments, then output a Pandoc element) and gradually supporting
non-obvious things as I went along. One thing that I struggled to get in
my head, and had to debug from scratch several times in different
places, was that alternative parsers don’t backtrack by default if they
fail after consuming some input and you have to insert try
if a parser and its alternatives potentially share a prefix.
Another reason I could get away without exactly knowing what I was
doing is that I wasn’t parsing into a structure of my own design, but
into Pandoc’s AST, so there was a whole design task I didn’t have to
iterate on. I didn’t have to decide on/design a parse for
mdoc
that made sense for a variety of applications, I just
had to represent the mdoc
source adequately for Pandoc.
Pandoc’s restricted AST for representing documents is one of its
virtues: it’s only because of the “narrow waist” of the Pandoc document
model that it can produce reasonable conversions between any source and
target format. The manual says “[b]ecause pandoc’s intermediate
representation of a document is less expressive than many of the formats
it converts between, one should not expect perfect conversions between
every format and every other.” This is very much the case with
mdoc
, which has rich semantics specialized to the task of
writing technical documentation for Unix utilities and C libraries. But
it’s possible (and common) for Pandoc readers to preserve a little bit
extra richness from the source language using element attributes and
generic Div
and Span
elements. In the
mdoc
reader, I take a cue from mandoc’s HTML output mode,
as seen for example on OpenBSD’s
and Void Linux’s man page
sites. There’s a lot of mdoc
macros (e.g. Li
,
Cm
, Ic
) that get mapped to Pandoc
Code
elements, but with the original macro name added to
the Code
’s class attribute. This is not presently
documented, but a lot of the specifics of how Pandoc decorates its
syntax tree are more or less fun bonuses for the attentive user.
Despite my long(ish) experience as a professional computer-toucher,
I’m not exceptionally confident with automated testing. In the
database-backed web application domain, I’ve never been able to use
things like mocks in a way that left me feeling like I was actually
testing something, as opposed to constructing an elaborate topology.
Testing pure code is much easier, and Pandoc has large and varied test
suite covering its various components. Relatively early in development I
found it easy to start populating a test suite for the mdoc
reader that compared the result of running the reader on a snippet of
code to the intended Pandoc AST. I snagged some test cases from mandoc’s
regression tests, others I contrived from my imagination or based on
particular bugs I had caught. Pandoc also has a bunch of tests called
“command tests” that compare the intended and actual output of running a
particular Pandoc command line; I wrote some of these for test inputs
that were more than a few lines.
Testing how macros parse in isolated cases is good but this is a
document language and the real evidence of success is whether we can
parse actual documents with it. In keeping with my scoping decisions, I
wanted a test corpus that would exercise my parser with as much
reasonable markup as possible and as little unreasonable markup as
possible. Naturally I decided to collect all the man pages written in
mdoc shipped in OpenBSD’s base system and use them as my test corpus.
Once I had implemented enough of mdoc
to parse simple
manual pages, I started running my reader on increasingly large subsets
of the corpus and outputting the stderr to files. Then I could grep
across all these outputs looking for common issues. Mostly this was
useful as an ad hoc way to work down the list of unimplemented macros,
but I also caught subtler issues this way, like parsing opening
delimiters at the end of certain macro lines. As I made progress, I had
the satisfaction of seeing the total number of skipped macros and parse
failures decrease. Ultimately, I decided I had covered enough ground
once I was seeing parse errors on just over a dozen of the 3500 man
pages I tested against.
When I shared my work on this in a group chat, a friend said “this is so much above my pay grade” and called it “real engineering shit”.2 And I don’t think that’s true! It took a lot of hours and for non-essential reasons I had to smear some less-elementary Haskell goo on things, but at no point did I have to ascend to a higher plane of parsing consciousness. My code probably has some special cases that could be folded back into general ones, and some general cases that are wrong, and some backtracking that could be eliminated, and certainly isn’t ideally organized for reading and maintenance. Which is not to say the code is bad or anything, or to bag on my own skills, but I didn’t bring any special expertise to the work. I just wanted to do it, had time to do it, and managed not to give up.
But saying “just” in that sentence obscures that motivation, time, and persistence are exactly the things that frequently elude me when I want to do a personal project. I only had the time to do this project because I was at most semi-employed during that time, and had no urgent need to get paying work. Persistence has been a problem for my whole adulthood; I hate to overburden the “ex-gifted-kid” cliché but I never had a need to “work hard” until college, when I eventually learned that in the end it’s my own standards that are hardest to meet. And a neat idea only suffices for a little bit of motivation, because it can be pretty satisfying to imagine what it would be like if I did something. I couldn’t say where the motivation to actually finish comes from, some occult psychological process, but I did finish, and the thing exists outside my imagination, and that’s pretty good.
If the ANSI writer and mdoc reader are useful to you and you want to give me a financial incentive to keep hacking on Pandoc, you can support me on GitHub Sponsors!
The heirloom doctools are a third line of development.↩︎
This same friend has been working on a synthesizer project running on ESP32 with a lot of input and output hardware, and to me that is “real engineering shit” that I wouldn’t have the patience for. I lack the fine motor control required for soldering, to be honest.↩︎