I’ve been reading and playing with VMs for a couple months now. I’ve silently created my own Ruby VM in C, running YARV bytecode. But that was an excuse to better understand the internals of Ruby, since it doesn’t have a parser, emitter and GC yet. I might blog about some of my findings later but today I want to write about another experiment I’ve been working on.

LLVM is an awesome tool and I’ve been wanting to build something with it for a while. So I’ve set myself to create a little toy language and it turned out being a lot easier then I though it would be. Luckily for me Tom Bagby created Ruby bindings for LLVM so I could code in Ruby instead of some damn ugly and painful C++ (I hate C++).

Here’s how I created my lil toy compiler called Orange.

The Parser

I used Treetop to create the parser. Writing the Treetop grammar was the most time consuming task. Also, because Treetop can’t produce context sensitive grammar, you’re limited in the kind of syntax you provide.

My goal was to have a syntax as close to Ruby as possible, but without OO. Something like this:

def test()
  x = 1
  y = "ohaie"
  printf("%d, %s", x, y)
end

Through trial an error I ended up with this: grammar.tt.

The Nodes

The parser takes some text as input and output AST nodes, one for each grammar rule matching your input. Parsing a = 1 would produce those nodes:

Expression
  Assign
    Var (a)
    Lit (1)

The idea is to generate some machine code from those nodes. If all node knows how to generate code for itself, then generating code for the whole tree is just a matter of generating code for the top level object, Expression in this case.

Treetop provides a nice mechanism to subclass those nodes.

rule call
  func:var arglist <Call>
end

The <Call> part tells Treetop to instantiate a new Call class for this type of node.

class Call < Node
  def codegen(g)
    # ...
  end
end

Here I added my own codegen method that will do all the magic of generating the machine code.

class Expression < Node
  def codegen(g)
    statements.map { |s| s.codegen(g) }
  end
end

The Expression top level node will just call the codegen of each nodes it contain recursively. This is a very simple way to handle complex trees of nodes.

The Code Generator

The only thing missing is the part that does the actual work of generating machine code that can be run.

Since the Orange language is very close to the actual LLVM intermediate language (on purpose, to keep things simple), the generator serve as a way to abstract calls to LLVM-Ruby API. With a much more advanced language, the generator would have a lot more things to take care of.

For example, the Generator will translate a function call to this:

# Creates a new module to hold the code
module = LLVM::Module.new("orange")
# Creates the main function that will be called
main = module.get_or_insert_function("main", Type.function(INT, [INT, Type.pointer(PCHAR)]))
# Create a block of code to build machine code within
block = main.create_block.builder

# Find the function we're calling in the module
func = module.get_function("myfunc")
# Call the function
block.call(func, arg)
block.return(0.llvm)

This is somehow equivalent to the following C code:

int main (int argc, char const *argv[]) {
  myfunc(arg);
  return 0;
}

This will then generate magic machine code for you.

(If you want more info on how LLVM-Ruby works, go read the “blog”:http://llvmruby.org/wordpress-llvmruby/)

Putting All the Pieces Together

Now we got all the pieces: the parser takes your text and translates it to nodes, you pass the code generator to the nodes that generates machine code and finally you run that machine code.

generator = Orange::Generator.new
parser    = OrangeParser.new

# Parse some code and get a top level node
node = parser.parse(code)
# Pass the generator to get the machine code
node.codegen(generator)

# Magic!
generator.run

The cool thing with LLVM is that it can run as a JIT compiler, running stuff on the fly, but it can also compile and optimize code a-la GCC.

# JIT compile & run
orange example/simple.or

# Inspect LLVM bytecode
orange -i example/simple.or

# Compile to native code
orange -c=simple.o example/simple.or
llvm-ld simple.o -o simple
./simple