Building a CEDICT parser in Rust with Nom

The CEDICT format is a simple, creative commons-licensed file format for Chinese/English dictionaries. While Mandarin-only CEDICT parsers abound, there is basically no support for Cantonese jyutping in the English-speaking programming world. As someone who would have liked to use Cantonese pronunciations in my programs, at first I was stuck. I considered adding jyutping support to an existing parser, but the example parser I found seemed difficult to extend in such a way. I had also never hand-written a parser before, other than some very simple tutorials.

I forgot how I stumbled across it, but that’s when I found nom, a “parser combinator” library whose motto is “eating data byte by byte.” Reading the documentation, I found that I could actually understand the example code for parsing a hexadecimal number into a color code.

nom, eating data byte by byte

What are parser combinators? I had never really looked into the term before this, assuming, as usual with topics in programming, that it was above me (this is the same cursed reason I avoided doing computer science in college). It turns out that they are a way to construct parsers like Lego blocks, piece by piece, until you have pieces that have the semantic meanings that you want, and can manipulate as you please.

The basic building blocks of parser combinators are as simple as commanding the computer to recognize (and separate) alphabetic characters, white spaces, or numerical characters. They get more complicated when you combine them, e.g. in English you would say “take as many alphabetic characters as possible then a number, and we’ll call that a syllable. In this spot between the brackets, there may be many syllables so take as many syllables as you can until you encounter the end bracket.”

After nom faithfully eats and sorts bytes for me, I get to store them into meaningful data structures.

Enough pseudo code, here’s the Rust code (using nom 6.1.2) I used to parse the CEDICT entries. We’re looking to parse the following format:

你好嗎 你好吗 [ni3 hao3 ma5] {nei5 hou2 maa1} /how are you?/ # CEDICT also supports comments

There are 5 sections delimited by spaces:

  • Traditional Chinese characters
  • Simplified Chinese characters
  • Pinyin, itself delimited by square brackets
  • Jyutping, delimited by squiggly brackets, an addition to the CEDICT format by Cantonese.org
  • English language definitions delimited by forward slashes
  • …and comments, though I’m unsure if they’re supported at the end of the line as written. I’ll skip the comments for now.

While parser combinators are easy to write from bottom to top, it may be easiest to understand them top down – I’ll start at the highest level of abstraction and gradually drill down into the details. Another bonus of parser combinators – it’s a joy to read them afterward, even after a few months!

So here’s the parser for a CedictEntry, the Rust struct I’m using to store data about a single entry’s various parts:

fn cedict_entry(i: &str) -> IResult<&str, CedictEntry> {
    let (i, traditional) = not_whitespace(i)?;
    let (i, _) = character::complete::space1(i)?;
    let (i, simplified) = not_whitespace(i)?;
    let (i, _) = character::complete::space1(i)?;
    let (i, pinyin) = pinyin(i)?;
    let (i, _) = character::complete::space1(i)?;
    let (i, jyutping) = combinator::opt(jyutping)(i)?;
    let (i, _) = character::complete::space0(i)?;
    let (i, definitions) = definitions(i)?;
    Ok((
        i,
        CedictEntry {
            traditional: traditional.into(),
            simplified: simplified.into(),
            pinyin,
            jyutping,
            definitions,
        },
    ))
}

Let’s start with the function signature:

fn cedict_entry(i: &str) -> IResult<&str, CedictEntry>

This is a function definition in Rust declaring a function cedict_entry that takes an input i of type &str and returns nom’s IResult type, which itself contains &str and a CedictEntry if all goes well. If all doesn’t go well, it will return an error, which you can see in IResult‘s type definition. The &str types here are important because nom attempts to be zero-copy for performance reasons – the &str type is immutable and does not allocate memory, serving only as a read-only window into data already loaded into memory.

The main body of the function is almost readable as pseudo-code:

let (i, traditional) = not_whitespace(i)?;
let (i, _) = character::complete::space1(i)?;
let (i, simplified) = not_whitespace(i)?;
let (i, _) = character::complete::space1(i)?;
let (i, pinyin) = pinyin(i)?;
let (i, _) = character::complete::space1(i)?;
let (i, jyutping) = combinator::opt(jyutping)(i)?;
let (i, _) = character::complete::space0(i)?;
let (i, definitions) = definitions(i)?;

This takes advantage of a few Rust features that come together nicely. First, let statements can do destructuring assignment. So if a function returns a tuple (1, 2), you can let (a, b) = (1, 2) and a will be equal to 1 and b will be equal to 2.

Second, Rust emphasizes safety and these parsers may fail, so we have to handle it. In Rust, expected failures are captured in the type system, specifically the Result type, which allow you to either extract the value or handle the error without the program blowing up and panicking. Parsers in nom return an IResult, which is a custom Result type. When the parsers fail, nom will default to returning a nom::ErrorKind. When they succeed, nom parsers will return a tuple of two elements, the first being the remaining input and the second being the result of the parsing. So if you were parsing "abc" with a parser combinator that looked for the character "a", for example, it might look something like this:

let letter_a_parser = bytes::complete::tag("a");
let result = letter_a_parser("abc");
# result == ("bc", "a")

But what happens in this code when one of these function calls returns an error? Is it just stored in the variables declared in the let statement? That’s where the question mark at the end of the line comes into play, Rust’s third salient feature. The question mark is a shortcut for saying, “if the Result of an expression is a valid value, the expression evaluates as that value. Otherwise if the Result is an error, then have not just the expression, but the entire calling function return the error.”

That way we can have our let tuples and our errors, too. 🍰

Finally, we have the code that actually stores the captured data in a struct:

Ok((
    i,
    CedictEntry {
        traditional: traditional.into(),
        simplified: simplified.into(),
        pinyin,
        jyutping,
        definitions,
    },
))

Ok is a variant of the Result enum, indicating that the function executed properly. The function returns an Ok containing the rest of the input, i, so that the next parser can have a go at the input, and a CedictEntry representing, well, the parsed Cedict entry. If anything does go wrong, it will return an Err, which will need to be handled at the site of the function call.

Now, a few of the parsers I used in the body of the function were provided by nom: not_whitespace, character::complete::space1, and character::complete::space0, for instance. As mentioned before, not_whitespace will match any character that isn’t whitespace, whereas space1 will match a single space and space0 will match zero or one space.

But a few of these are custom parsers that I wrote from nom’s more basic building blocks: pinyin, jyutping, and definitions. Let’s take a look at those.

The pinyin / jyutping parsers

1️⃣ 你好嗎 2️⃣ 你好吗 3️⃣ [ni3 hao3 ma5] 4️⃣ {nei5 hou2 maa1} 5️⃣ /how are you?/

We managed to get by using nom’s builtin not_whitespace parser for 1️⃣ and 2️⃣ (as labeled above). Basically, we expect the start of the line to be characters, then a space, then more characters of a different sort. However, we can’t simply limit it to Han Unicode characters because some of the characters in the dictionary involve alphanumeric characters. In the interests of simplicity, I opted for a parser that simply matches any non-whitespace character for the first two.

For number 3️⃣, however, we’re going to need to do things a little differently. I’d like to be able to retain specific information about each syllable instead of just grabbing the entire string like I did with the Chinese characters, so the parser is more complicated. Here’s the parser for pinyin:

fn pinyin(i: &str) -> IResult<&str, Option<Vec<Syllable>>> {
    # Here we split the pronunciations out from their square bracket delimiters:
    let (rest, (_, pronunciations, _)) = sequence::tuple((
        bytes::complete::tag("["),
        combinator::opt(bytes::complete::is_not("]")),
        bytes::complete::tag("]"),
    ))(i)?;
    # Here we handle the case where there's nothing between the brackets
    if let Some(pronunciations) = pronunciations {
        let (_, syllables) = syllables(pronunciations)?;

        Ok((rest, Some(syllables)))
    } else {
        Ok((rest, None))
    }
}

Let’s talk about this in detail, starting again with the function signature:

fn pinyin(i: &str) -> IResult<&str, Option<Vec<Syllable>>>

This takes a &str and returns an IResult<&str, Option<Vec<Syllable>>>. The IResult returns a tuple of (&str, Option<Vec<Syllable>>. In the case where we have parsed a valid line such as the one above, &str is a reference to the remainder of the input – in this case, it would be parts 4️⃣ and 5️⃣ as shown below:

" {nei5 hou2 maa1} /how are you?/"

And the Option<Vec<Syllable>> resulting from the pinyin parser would look like this:

Some([
    Syllable {
        pronunciation: "ni",
        tone: "3",
    },
    Syllable {
        pronunciation: "hao",
        tone: "3",
    },
])

(Note that if there were no valid syllables between the brackets it would result in a None return type instead of a Some containing a Vec. Such is the nature of the Option Enum.)

First, we need to separate the pronunciations from their brackets, which is what this chunk of code does:

let 0️⃣ (rest, (_, pronunciations, _)) = sequence::tuple((
    1️⃣ bytes::complete::tag("["),
    2️⃣ combinator::opt(bytes::complete::is_not("]")),
    3️⃣ bytes::complete::tag("]"),
))(i)?;

I’m using sequence::tuple, which is a built-in nom parser that takes a tuple of parser combinators and returns a tuple of the rest of the output and an inner tuple of all of the components we ask for. The tuple we supplied included:

  • 1️⃣ a parser for the opening bracket of the pinyin,
  • 2️⃣ a parser for any bytes that are not equal to "]", made optional via nom’s built-in parser, combinator::opt, and
  • 3️⃣ the ending bracket.

We expect the output in the case of a successful parse to contain the rest of the input string, 0️⃣, followed by a tuple of (1️⃣, 2️⃣, and 3️⃣), of which we only retain 2️⃣. In Rust, any variable beginning with an underscore tells the compiler we won’t be using it, and it is a compile time warning if either an underscored variable is used, or if a non-underscored variable goes unused.

By the end of this, we should have two variables declared: rest, which is the rest of the input, and pronunciations, which is whatever was contained between the two square brackets – in this case "ni3 hao3 ma5". Now to extract the individual syllables.

if let Some(pronunciations) = pronunciations {
    let (_, syllables) = syllables(pronunciations)?;

    Ok((rest, Some(syllables)))
} else {
    Ok((rest, None))
}

We then check if pronunciations is a None type indicating that it didn’t find what you were looking for, or a Some type which includes the combinator results you asked for. If the result is Some(pronunciations) then we punt it down the line to the syllables parser combinator:

/// takes a series of possibly undelimited syllables such as "ni3hao3" and returns a Vec of Syllables
fn syllables(i: &str) -> IResult<&str, Vec<Syllable>> {
    multi::many0(syllable)(i)
}
fn syllable(i: &str) -> IResult<&str, Syllable> {
    let (rest, (_, pronunciation, tone)) = sequence::tuple((
        character::complete::space0,
        character::complete::alpha1,
        character::complete::digit0,
    ))(i)?;
    Ok((rest, Syllable::new(pronunciation, tone)))
}

Here, I’ve broken syllables down into two separate functions, but we could easily inline the syllable parser. We start out with multi::many0 which will match zero or more syllable parsers. Within the syllable parser we can see another sequence::tuple parser, which we are supplying with space0, alpha1, and digit0. Translated to English, this would match against zero or more spaces, then one or more alphabetic ASCII characters, then zero or more ASCII numerical characters, representing any leading whitespace, the pronunciation, and finally the tone (see more about tones in Mandarin here).

If that’s fine, we return the rest of the input to the multi::many0 parser, as well as a new Syllable struct populated with the pronunciation and the tone, then do it over and over again until the syllable parser stops matching. Finally, we’re left with our Vec of Syllable structs, as captured in this unit test:

#[test]
fn test_parse_syllables() {
    assert_eq!(
        syllables("ni3hao3"),
        Ok((
            "",
            vec![
                Syllable::new("ni", "3"),
                Syllable::new("hao", "3")
            ]
        ))
    );
}

Thankfully, we can do much the same thing to handle the Cantonese jyutping pronunciations, just by changing the delimiters:

fn jyutping(i: &str) -> IResult<&str, Vec<Syllable>> {
    let (rest, pronunciations) = sequence::delimited(
        bytes::complete::tag("{"),
        bytes::complete::is_not("}"),
        bytes::complete::tag("}"),
    )(i)?;
    let (_, syllables) = syllables(pronunciations)?;
    Ok((rest, syllables))
}

Hopefully that helps explain this particular use case of nom in Rust. Also, if I’ve made any mistakes in my explanations, please let me know! I’m new both to parsers and to Rust. As a novice to writing parsers, I found it very approachable, if a bit mystifying which parsers to use, at first.

On that note, nom divides its parser combinators into a couple major categories. There’s more to nom, and I’ve only just begun to explore it, but my mental map of the combinators are bytes versus characters, complete vs streaming, collections, and control flow combinators. I’ve mapped them out below:

  • bytes vs characters – literally represented by nom::bytes vs nom::character. For when you need to parse at the byte level vs characters. The distinction can be a little fuzzy to me.
  • complete vs streaming – the above modules each split into complete and streaming versions for when you have to handle streaming input versus when you have the entire input data at your disposal.
  • collections – the nom::multi module will give you parsers for handling repetitive data. We used this module to handle multiple syllables in pinyin and jyutping.
  • control flow – I think of nom::branch and nom::combinator as being control flow

Sometimes the hardest thing is finding the right parser combinator. For that, there is now a handy guide: https://github.com/Geal/nom/blob/main/doc/choosing_a_combinator.md

Note that there are many more modules that I haven’t used that look like they would come in handy, such as nom::number for parsing numbers and nom::recipes, which might have saved me a good amount of time!

Happy parsing!

Leave a Reply

Your email address will not be published. Required fields are marked *