Mike Schaeffer's Blog

Articles with tag: lisp
April 16, 2009

I recently was asked why I like Lisp. For me, it boils down to the fact that Lisp makes it easy to control when your code is evaluated. Most languages only let you evaluate code at runtime. There are ways around this (C++ templates, cpp, code generation, etc.), but they all have severe limitations. In contrast, Lisp makes it easy to run actual Lisp code at compile time (macros) or even read time (reader macros). Combine that with the fact that Lisp code is pretty easy to manipulate with Lisp itself, and it becomes much easier to do things that are usually restricted to language developers, which can be quite a force multiplier. Just to illustrate, most of the language features that C# has over Java (LINQ, properties, closures, lambda syntax, yield return etc.) could be added to Java by 'ordinary developers' if Java somehow had these things I like so much about Lisp.

But it doesn't...

April 10, 2008

A few months ago, I ran into a problem with a macro that seriously changed my opinions on how they should be used. It all comes down to the fact that macro are incorporated into compiler output. Two pieces of code that look nicely decoupled in the source text can end up very entwined with each other, once they are compiled.

To illustrate, I'll use the macro in question, something I once used to accept a sort of simulated 'multiple return value' in a dialect of Scheme. This is a low level example, something from my hobby work, but it can apply equally well to other uses of macros.

(defmacro (values-bind form vars . body)
  (with-gensyms (form-rv-sym)
    `(let ((,form-rv-sym ,form))
       (list-let ,vars (if (%values-tuple? ,form-rv-sym)
                           (slot-ref ,form-rv-sym 'v)
                           (list ,form-rv-sym))
         ,@body))))

This macro expands code like this:

(values-bind (returns-2-args 'foo) (arg-1 arg-2)
   (+ arg-1 arg-2))

Into code that looks like this:

(let ((#:form-rv-sym-69@00beeec4 (returns-2-args 'foo)))
   (list-let (arg-1 arg-2) (if (%values-tuple? #:form-rv-sym-69@00beeec4)
                              (slot-ref #:form-rv-sym-69@00beeec4 'v)
                              (list #:form-rv-sym-69@00beeec4))
     (+ arg-1 arg-2)))

And then, the compiler compiles that form and drops the result into the output file, which now contains several pretty deep assumptions about the simulated multiple value protocol it needs to honor:

  • Values are returned in a single value that satifies %values-tuple?.
  • Values are extracted from a tuple with a call to slot-ref for slot v.
  • Values are stored within slot as a list.

While the source text that uses values-bind doesn't need to know any of these details, the compiler output does. This results in compiler output that is very closely tied to the value protocol; Compiler output that is likely to be incompatible with any changes to that protocol.

In many development scenarios, this doesn't matter. Within a single project, if compiled file A comes to depend on assumptions embedded in macros from file B, it's less of an issue: both files are usually compiled at the same time. If both files can't be simultaneously compiled, things start to go wrong. I ran into this issue myself when trying to change the multiple value protocol I was using in my compiler. My core library was built with the old protocol, my new library was to be built with the new protocol, and the two could not interoperate for the brief period of time necessary to produce a compiled version of the new library. There are several possible approaches to solving this, but but one I took was the two step of building a new 'old' library that can handle both protocols, using it to compile a version that works only with the new protocol, and then switching over completely. It was a mess, and a mess I created myself with a macro that expanded into something that assumed way too much. The better approach, the approach that I switched to, is this:

(define (call-with-values proc vals)
  (apply proc (%values->list vals)))

(defmacro (values-bind form vars . body)
  `(call-with-values (lambda ,vars ,@body) ,form))

This expands the above code to something more palatable:

(call-with-values (lambda (arg-1 arg-2)
                     (+ arg-1 arg-2))
                  (returns-2-args 'foo))

The only assumption this makes in the compiled output is that there's a function call-with-values that calls its first argument with values passed in as its second argument. All of the gory details, which could easily be the same three from my list, are hidden behind function calls and dynamic linkage. This is actually the representation that made the two-step cutover approach plausible. Switching to this version of the values-bind macro removed assumptions about the value protocol from every call site, and made it easy to switch.

The upshot of this is something that's, I'm sure, pretty common knowledge in Lisp/Scheme circles: macros are best when limited to syntax, with the underlying functionality implemented in a more functional interface. The functional interface keeps things more decoupled, even when compiled, and leaves your software more managable. It also provides a second way to 'get at' the functionality provided by the underlying code. With the function/macro split, the macro expansionn can be avoided entirely, in the case when you already have a closure that contains the code you need to run.

One more brief example, a bit higher up the 'stack' in the language environment is the transformation of this macro:

(defmacro (with-output-to-string . code)
  (with-gensyms (saved-output-port-sym output-string-sym)
    `(let ((,saved-output-port-sym (current-output-port))
           (,output-string-sym (open-output-string)))
       (unwind-protect (lambda ()
                         (set-current-output-port ,output-string-sym)
                         ,@code
                         (get-output-string ,output-string-sym))
                       (lambda ()
                         (set-current-output-port ,saved-output-port-sym))))))

Into this macro/function pair:

(define (call-with-output-to-string fn)
  (let ((saved-output-port (current-output-port))
        (output-string (open-output-string)))
    (unwind-protect (lambda ()
                      (set-current-output-port output-string)
                      (fn)
                      (get-output-string output-string))
                    (lambda ()
                      (set-current-output-port saved-output-port)))))

(defmacro (with-output-to-string . code)
  `(call-with-output-to-string (lambda () ,@code)))
February 14, 2008

A couple weeks ago, I got into a brief reddit discussion on the relative merits of Lisp's car and cdr functions. Given a Lisp list, applying car to the list returns the first element of the list and applying cdr to the list returns a list of every element excluding the first. For someone new to Lisp (as we all were once), these names can be a bit awkward. However, like many other aspects of the language, there is more to car and cdr than meets the eye.

The first implementation of Lisp was done by Steve Russell on an IBM 704. The 704 was a 36-bit vacuum tube machine that IBM started selling in 1954. By the time it was discontinued in 1960, they had sold a total of 123 of the machines, each capable of a whopping 40,000 calculations per second. Russell's original 1959 implementation of Lisp on this machine took advantage of the fact that the 704's instruction set had special capabilities for accessing two distinct 15 bit fields of a 36 bit value loaded into a machine register: the address and decrement fields. In Russell's own words: "Because of an unfortunate temporary lapse of inspiration, we couldn't think of any other names for the 2 pointers in a list node than 'address' and 'decrement', so we called the functions CAR for 'Contents of Address of Register' and CDR for 'Contents of Decrement of Register'." Interestingly enough, he continues with this: "After several months and giving a few classes in LISP, we realized that 'first' and 'rest' were better names, and we (John McCarthy, I and some of the rest of the AI Project) tried to get people to use them instead. ... Alas, it was too late! We couldn't make it stick at all. So we have CAR and CDR. " So there you have it: car and cdr, two of the most famous and widely used functions of Lisp and its descendents, owe their names to a bizarre quirk of a computer architecture that's been obsolete for close to fifty years. For the record, here's a source listing for the original 704 implementataion of car taken from MIT AI Lab Memo 6:

LXD JLOC, 4
CLA 0,4
PAX 0,4
PSX 0,4

But the story of car and cdr doesn't stop there. In the 1960's and early 70's, the University of Texas at Austin ran a computer called a CDC 6600. The 6600 was one of Seymour Cray's first big supercomputer designs, and was the fastest computer in the world for a time. It had a 60-bit machine word and an 18-bit address space, so you can probably see where this is going. The designers of UT's Lisp for the CDC 6600 added a third field to cons cells, giving them each three pointers, the car, cdr, and csr. I'm sure the third pointer was useful for implementing things like trees of nodes and lists of key/value pairs, although apparantly not useful enough to stick around. Two pointer cons cells are a better fit for modern hardware, and two two pointer cons cells can represent everything that a single three pointer cons cell can represent.

Back at MIT in the 70's, and before things like defstruct, Maclisp took the idea of multi-pointer cons cells to what must be its logical extreme: hunks. A Maclisp hunk was a structure like a cons cell that could hold an arbitrary number of pointers, up to total of 512. Each of these slots in a hunk was referred to as a numbered cxr, with a numbering scheme that went like this: `( cxr-1 cxr-2 cxr-3 ... cxr-n cxr-0 ). No matter how many slots were in the hunk, car was equivalent to (cxr 1 hunk) and cdr was equivalent to (cxr 0 hunk)`. This is a nice generalization of the basic idea of a cons cell, but modern Lisps offer other ways to structure data that are both possibly more useful and more readable: structures for a fixed collection of named slots, hash tables for a variable collection of named slots, and vectors for a collection of numbered slots.

After these historical blind alleys, it's interesting to think about why car and cdr still persist fifty years after McCarthy and Russell roamed the halls of MIT evangelising first and rest. Common Lisp does at least have first and rest as part of the standard. However, when I was taught Lisp in the mid-90's, I was encouraged to primarily favor the older car and cdr. I remember two primary reasons for this. The first was that existing code favored car and cdr, so it was important to be able to read code written in that style. The second reason was that first and rest impose a particular meaning on the fields of a cons cell that may or may not be appropriate. In the common case of a linear list, first and rest work rather well. If you call first on the list, you get the first element, if you call rest, you get the rest. In the case of a cons cell as a node of an association list, they work less well, unless, that is, you can figure out a reason why first makes sense as 'key', and rest makes sense as 'value'.

Some of this confusion stems from the fact that most Lisps, despite the name List Processing, don't have an official list data type. What they have instead is a two element cons cell and a set of conventions in the library, reader, and writer for using the to make linear lists. In a sense, this is a lot like strings in C. C doesn't have a string type, what it has instead is a pointer to character data (char *) and a set of library conventions for using blocks of memory as strings of characters. This laxness on the part of both languages comes with the advantages and disadvantages you'd expect from letting the deatils of an underlying implementation leak through. In the case of C 'strings', the representation lends itself both to things like Rob Pike's beautiful regex implementation in The Practice of Programming and a seemingly never ending series of buffer overrun attacks. In the case of Lisp cons cells, it provides both an incredibly flexible data structure, and confusion over such basic notions as the 'first', 'second', and 'rest' of a list.

If something as baroque as 'car' actually makes more sense than 'first' because 'first' doesn't match up well to underlying abstraction, it might them make sense to reconsider the underlying implementation. Some modern Lisps like Clojure do just that; A Clojure 'list' isn't a string of cons cells, but rather an instance of a JVM object that implements the interface ISeq:

public interface ISeq extends IPersistentCollection{
   Object first();
   ISeq rest();
   ISeq cons(Object o);
}

Clojure's user-visible first and rest functions ultimately call into their like-named methods in ISeq:

static public Object first(Object x){
   ISeq seq = seq(x);
   if(seq == null)
      return null;
   return seq.first();
}

// ...

static public ISeq rest(Object x){
   ISeq seq = seq(x);
   if(seq == null)
      return null;
   return seq.rest();
}

A noteworthy difference between this and a 'conventional' Lisp is the return type of rest: it's another ISeq, rather than a Object. Because of this, the 'CDR' of a Clojure cons cell has a new constraint: it is constrained to be another sequence, increasing greatly the likelihood that 'rest' really is 'the rest'. While this could be done even if rest returned an Object, constraining rest to be a sequence eliminates a number of edge cases in the language that arise when you allow the rest of a list to be something other than a list itself. This altered representation also fits in nicely with Clojure's host JVM: there's nothing that says ISeq has to be implemented by a two-element pointer. Indeed, Closure "also implements first and rest for vectors, strings, arrays, maps, Java Iterables, lazily calculated and infinite sequences etc." All these implementations of ISeq make it easier for Clojure sequences to interoperate with Java, and it makes it easier to build a sequence library that works on all kinds of sequences. What's lost with this choice is the ability to use a cons cell as an informal two-element structure. Even then, this style of first and rest could co-exist with implementations of car and cdr that work the 'old way'.

So in the end, maybe car and cdr are all right. They aren't the best names in the world, but they fit nicely the semantics of a unrestricted two-pointer cons cell. For those cases where you really are finding the first and rest of a list of cons cells, it is easy to use the first and rest functions in lieu of car and cdr. Then, for dialects like Clojure, your code is automatically portable to other sequence types. If you're using cons cells as a ad hoc structure, then you can either use car and cdr and accept those names as historical baggage of the second oldest major programming language, or investigate some of the other more modern ways of structuring data in Lisp programs.

January 8, 2008

Lately, I've been thinking a bit about the way language design influences library design. My line of thought started out inspired by some of the recent conversations about closures in Java, but it ended up also touching on dynamic typing and a few other 'modern' language features. This will end up being more than one post, but I thought I'd record some of it in my blog, with the hope that it might shed some light for somebody, somewhere.

To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if `strcmp(internstring(a), internstring(b)) == 0 holds true, then internstring(a) == internstring(b)` also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.

Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:

  • int hash_has(hash_table_t ht, char *key) - Returns TRUE or FALSE, depending on whether or not the key key is found in the hash table ht.
  • char *hash_ref(hash_table_t ht, char *key) - Returns the value bound to the key key by the hash table ht. If the key is not found, behavior is undefined.
  • char *hash_set(hash_table_t ht, char *key, char *value) - Binds the value value to the key key in the hash table ht. If the key is already present, the existing value is overwritten. This function returns value.

Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.

hash_table_t intern_table; // assume this is initialized somewhere else.

char *intern_string(char *str)
{
  if (hash_has(intern_table, str))
     return hash_ref(intern_table, str);

  char *interned_str = strdup(str);

  hash_set(intern_table, interned_str, interned_str);

  return interned_str;
}

The first step of intern_string is to check to see if the intern table already contains a string with the value of the new string. If the new string is already in the intern table, then the function returns the copy that's in the table. Otherwise, a new copy of the incoming string is created and stored in the hash table. In either case, the string returned is in the the intern table. This logic ensures that every time intern_string is called with a str of the same value, it returns the same exact string.

If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.

Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str = hash_ref(intern_table, str);

  if (interned_str == NULL)
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

For this string interning, this change to hash_ref interface works fairly well. We know that we'll never store a hash key with a NULL value, so we know that NULL is safe to use for signaling that a key was not found. Were this ever to change, this version of hash_ref doesn't return enough information to distinguish between the 'key not found' case and the 'NULL value' case. Both would return NULL. To fix this, hash_ref needs to be extended to also return a seperate value that indicates if the key was found. This can be done in C by having hash_ref return the 'key found' flag as a return value, and also accept a pointer to a buffer that will contain the key's value, if it's found:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str;

  if (!hash_ref(intern_table, str, &interned_str))
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

This is probably about as good as you can get in straight C. It easily distinguishes between the 'no-value' and 'no-key' cases, it's relatively clear to read, and it uses the common idioms of the language. However, C is a relatively sparse language. If you're willing to switch to something a bit more expressive, you have other choices.

One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:

(define *intern-table* (make-hash :equal))

(define (intern-string str)
 (let ((interned-str (hash-ref *intern-table* str 42)))
  (cond ((= interned-str 42)
          (hash-set! *intern-table* str str)
           str)
        (#t
          interned-str)))))

If you prefer C/JavaScript-style syntax, it looks like this:

var intern_table = make_hash(EQUAL);

function intern_string(str)

{
   var interned_str = hash_ref(intern_table, str, 42);

   if (interned_str == 42)
   {
       hash_set(intern_table, str, str);
       return str;
   }

   return interned_str;
}

In this case, hash_ref has been extended with a third argument: a default return value if the key is not found. The above code uses this to have hash_ref return a number in 'no value' case, and it's the type itself of this return value that ensures its distinctness. This is a common dynamic language idiom, but for a moment, consider what it would look like in C:

hash_table_t intern_table;

char *intern_string(char *str)
{
  char *interned_str = hash_ref(intern_table, str, (char *)42);

  if (interned_str == (char *)42)
  {
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

At first, this actually seems like it might a plausible implementation of intern_string. My guess is that it might even work most of the time. Where this implementation gets into trouble is the case when an interned string might reasonably be located at address 42. Because C is statically typed, When hash_ref returns, all it returns is a pointer. The caller cannot distinguish between the 'valid string at address 42' return value and the 'no-key' return value. This is basically the same problem as the case where we overloaded NULL to signal 'no-key'.

The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.

Older Articles...