Writing a Tokenizer with Iterator and Peekable

In this tutorial we will write a low-level tokenizer for tokenizing simple math expressions like “12+34*56” using Rust’s Iterator and Peekable traits.

First, we need to declare some data structures for storing the results. Rust only has two types of data structure – struct and enum. If you’re coming from a Java or Scala background you would possibly be wanting classes with an object-oriented hierarchy, but enums are the way to go here.

Rust enumerations are a great way of modeling data structures and can be nested and recursive with the help of Box and Vec. However, in this example we will keep things simple and avoid recursion. We’ll tackle that in part two of this tutorial where we will build a parser for math expressions.

We will use the #[derive(…)] syntax to implement the Debug and PartialEq traits for these enumerations to make it easy to write unit tests.

We now need a tokenize function that can take a String as input and produce a Vec as output. There are two main approaches for doing this in Rust – we can write a standalone function that accepts a String parameter or we can define a trait and implement the trait directly on the String class.

The latter is generally the better approach since it ties functionality to a particular type, and if we want to be able tokenize other types in the future we can implement the trait for those types too.

Here is our Tokenizer trait definition:

We now use the following syntax to implement the trait for the String type:

We can get an iterator over the chars in the string by calling the chars() method, but we want the ability to peek at the next character without consuming it, so we call peekable() on the iterator to give us a Peekable trait.

We also need to prepare a vector to store the tokens:

Next, we need to loop until all characters have been processed. Here’s the starting code for this.

There are some important things to explain here. Firstly, the loop keyword is equivalent to while(true) in other languages and will loop until hitting a break or return statement (or a panic).

Next, we’re performing pattern matching on the result of it.peek(). Because this method returns an Option<&T> (an optional reference, rather than an optional value) we need to match with Some(&ch) rather than Some(ch) otherwise we start running into lifetime issues. If peek() returns Some(&ch) then we can start building a specific token and if it peek() returns None we can quit the loop since there are no characters left to process.

Here is the full implementation of the tokenize() function:

A design goal here is to keep the tokenizing code as functional as possible, so we make use of a function named my_take_while to iterate over the string collecting characters for as long as the supplied closure returns true. This is very similar to the take_while() method on Iterator with one major difference – our version only consumes characters that match, whereas the Iterator implementation consumes one extra character that does not match, making it unsuitable for our purposes.

When I started writing this tutorial I had hoped it would be much simpler and we’re already getting into closures and functional programming. However, if you stumbled across this because you are trying to implement a tokenizer in Rust, hopefully it will give you some good ideas.

The full source code for this example is available in github here:

https://github.com/KeepCalmAndLearnRust/pratt-parser

Part two of this tutorial is now available: http://keepcalmandlearnrust.com/2016/08/pratt-parser-in-rust/

Leave a Reply

Your email address will not be published. Required fields are marked *