Mike Schaeffer's Weblog
Wed, 29 Jun 2011
Symbol Identity in Lisp and Clojure
One typical property of Lisp systems is that they they intern symbols. Roughly speaking, when symbols are interned, two symbols with the same print name will also have the same identity. This design choice has several significant implications elsewhere in the Lisp implementation. It is also one of the places where Clojure differs from Lisp tradition.

In code, the most basic version of the intern algorithm is easy to express:
(define (intern! symbol-name)
   (unless (hash-has? *symbol-table* symbol-name)
      (hash-set! *symbol-table* symbol-name (make-symbol symbol-name)))
   (hash-ref *symbol-table* symbol-name))
This code returns the symbol with the given name in the global symbol table. If there's not already a symbol under that name in the global table, it creates a symbol with that name and stores it in the hash prior to returning it.. This ensures that make-symbol is only called once for each symbol-name, and the symbol stored in *symbol-table* is always the symbol returned for a given name. Any two calls to intern! with the same name are therefore guaranteed to return the exact, same, eq? symbol object. At a vCalc REPL, this looks like so (The fact that both symbols are printed with ##0 implies that they have the same identity.):
user> (intern! "test-symbol")
; ##0 = test-symbol
user> (intern! "test-symbol")
; ##0 = test-symbol
This design has several properties that have historially been useful when implementing a Lisp. First, by sharing the internal representation of symbols with the same print name, interning can reduce memory consumption. A careful programmer can write an implementation of interned symbols that doesn't allocate any memory on the heap unless it sees a new, distinct symbol. Interning also gives a (theoretically) cheaper mechanism for comparing two symbols for equality. Enforcing symbol identity equality for symbol name equality implies that symbol name equality can be reduced to a single machine instruction. In the early days of Lisp, these were very significant advantages. With modern hardware, they are less important. However, the semantics of interned symbols do still differ in important ways.

One example of this is that interned symbols make it easy to provide a global environment 'for free'. To see what I mean by this, here is the vCalc declaration of a symbol:
...
          struct
          {
               ...
               lref_t vcell;  // Current Global Variable Binding
               ...
          } symbol;
...
Each symbol carries with it three fields that are specific to each symbol, and are created and initialized at the time the symbol is created. Because vcell for the symbol is created at the same time as the symbol, the global variable named by the symbol is created at the same time as the symbol itself. Accessing the value of that global variable is done through a field stored at an offset relative to the beginning of the symbol. These benefits also accrue to property lists, as they can also be stored in a field of a symbol. This is a cheap implementation strategy for global variables and property lists, but it comes at the cost of imposing a tight coupling between two distinct concepts: symbols and the global environment.

The upside of this coupling is that it encourages the use of global symbol attributes (bindings and properties). During interactive programming at a REPL, global bindings turn out to be useful because they make it easy to 'say the name' of the bindings to the environment. For bindingsthat directly map to symbols, the symbol itself is sufficient to name the binding and use it during debugging. Consider this definition:
(define *current-counter-value* 0)

(define (next-counter-value)
   (incr! *current-counter-value*)
   *current-counter-value*)
This definition of next-counter-value makes it easy to inspect the current counter value. It's stored in a global variable binding, so it can be inspected and modified during debugging using its name: *current-counter-value*. A more modular style of programming would store the current counter value in a binding local to the definition of next-counter-value:
(let ((current-counter-value 0))
  (define (next-counter-value)
    (incr! current-counter-value)
    current-counter-value))
This is 'better' from a stylistic point of view, because it reduces the scope of the binding of current-counter-value entirely to the scope of the next-counter-value function. This eliminates the chance that 'somebody else' will break encapsulation and modify the counter value in a harmful fashion. Unfortunately, 'somebody else' also includes the REPL itself. The 'better' design imposes the cost that it's no longer as easy to inspect or modify the current-counter-value from the REPL. (Doing so requires the ability to inspect or name the local bindings captured in the next-counter-value closure.)

The tight coupling between interned symbols and global variable bindings should not come as a suprise, because interning a symbol necessarily makes the symbol itself global. In a Lisp that interns symbols, the following code fragment creates two distinct local variable bindings, despite the fact that the bindings are named by the same, eq? symbol: local-variable.
(let ((local-variable 0))
   (let ((local-variable 0))
      local-variable))
The mismatch between globally interned symbols and local bindings implies that symbols cannot as directly be involved in talking about local bindings. A Common Lisp type declaration is an S-expression that says something about the variable named by a symbol.
...
        (declare (fixnum el))
...
In contrast, a Clojure type declaration is a reader expression that attaches metadata to the symbol itself:
...
        ^String x
...
The ^ syntax in Clojure gathers up metadata and then applies it using withMeta to the next expression in the input stream. In the case of a type declaration, the metadata gets applied to the symbol naming the binding. This can be done in one of two ways. The first is to destructively update metadata attached to an interned symbol. If Clojure had done this, then each occurrance of symbol metadata would overwrite whatever metadata was there before, and that one copy of the metadata would apply to every occurance of the symbol in the source text. Every variable with the same name would have to have the same type declarations.

Clojure took the other approach, and avoids the problem by not interning symbols. This allows metadata to be bound to a symbol locally. In the Clojure equivalent of the local-variable example, there are two local variables and each are named by two distinct symbols (but with the same name).
(let [local-variable 0]
   (let [local-variable 0]
      local-variable))
The flexibility of this approch is useful, but it comes at the cost of losing the ability to store values in the symbols themselves. Global symbol property lists in Lisp have to be modeled using some other means. Global variable bindings are not stored in symbols, but rather in Vars. (This is something that compiled Lisps tend to do anyway.) These changes result in symbols in Clojure becoming slightly 'smaller' than in Lisp, and more well aligned with how they are used in moodern, lexically scoped Lisps. Because there are still global variable bindings availble in the language, the naming benefits of globals are still available for use in the REPL. It's taken a while for me to get there, but the overall effect of un-interned symbols on the design of a Lisp seems generally positive.

reddit this! Digg Me!

[/tech/lisp] permanent link

Thu, 16 Apr 2009
Why Lisp?
I recently was asked why I like Lisp. For me, it boils down to the fact that Lisp makes it easy to control when your code is evaluated. Most languages only let you evaluate code at runtime. There are ways around this (C++ templates, cpp, code generation, etc.), but they all have severe limitations. In contrast, Lisp makes it easy to run actual Lisp code at compile time (macros) or even read time (reader macros). Combine that with the fact that Lisp code is pretty easy to manipulate with Lisp itself, and it becomes much easier to do things that are usually restricted to language developers, which can be quite a force multiplier. Just to illustrate, most of the language features that C# has over Java (LINQ, properties, closures, lambda syntax, yield return etc.) could be added to Java by 'ordinary developers' if Java somehow had these things I like so much about Lisp.

But it doesn't...

reddit this! Digg Me!

[/tech/lisp] permanent link

Thu, 10 Apr 2008
defmacro and coupling.
A few months ago, I ran into a problem with a macro that seriously changed my opinions on how they should be used. It all comes down to the fact that macro are incorporated into compiler output. Two pieces of code that look nicely decoupled in the source text can end up very entwined with each other, once they are compiled.

To illustrate, I'll use the macro in question, something I once used to accept a sort of simulated 'multiple return value' in a dialect of Scheme. This is a low level example, something from my hobby work, but it can apply equally well to other uses of macros.
(defmacro (values-bind form vars . body)
  (with-gensyms (form-rv-sym)
    `(let ((,form-rv-sym ,form))
       (list-let ,vars (if (%values-tuple? ,form-rv-sym)
                           (slot-ref ,form-rv-sym 'v)
                           (list ,form-rv-sym))
         ,@body))))
This macro expands code like this:
(values-bind (returns-2-args 'foo) (arg-1 arg-2)
   (+ arg-1 arg-2))
Into code that looks like this:
(let ((#:form-rv-sym-69@00beeec4 (returns-2-args 'foo)))
   (list-let (arg-1 arg-2) (if (%values-tuple? #:form-rv-sym-69@00beeec4)
                              (slot-ref #:form-rv-sym-69@00beeec4 'v)
                              (list #:form-rv-sym-69@00beeec4))
     (+ arg-1 arg-2)))
And then, the compiler compiles that form and drops the result into the output file, which now contains several pretty deep assumptions about the simulated multiple value protocol it needs to honor:

  • Values are returned in a single value that satifies %values-tuple?.
  • Values are extracted from a tuple with a call to slot-ref for slot v.
  • Values are stored within slot as a list.
    While the source text that uses values-bind doesn't need to know any of these details, the compiler output does. This results in compiler output that is very closely tied to the value protocol; Compiler output that is likely to be incompatible with any changes to that protocol.

    In many development scenarios, this doesn't matter. Within a single project, if compiled file A comes to depend on assumptions embedded in macros from file B, it's less of an issue: both files are usually compiled at the same time. If both files can't be simultaneously compiled, things start to go wrong. I ran into this issue myself when trying to change the multiple value protocol I was using in my compiler. My core library was built with the old protocol, my new library was to be built with the new protocol, and the two could not interoperate for the brief period of time necessary to produce a compiled version of the new library. There are several possible approaches to solving this, but but one I took was the two step of building a new 'old' library that can handle both protocols, using it to compile a version that works only with the new protocol, and then switching over completely. It was a mess, and a mess I created myself with a macro that expanded into something that assumed way too much. The better approach, the approach that I switched to, is this:
    (define (call-with-values proc vals)
      (apply proc (%values->list vals)))
    
    (defmacro (values-bind form vars . body)
      `(call-with-values (lambda ,vars ,@body) ,form))
    
    This expands the above code to something more palatable:
    (call-with-values (lambda (arg-1 arg-2)
                         (+ arg-1 arg-2))
                      (returns-2-args 'foo))
    
    The only assumption this makes in the compiled output is that there's a function call-with-values that calls its first argument with values passed in as its second argument. All of the gory details, which could easily be the same three from my list, are hidden behind function calls and dynamic linkage. This is actually the representation that made the two-step cutover approach plausible. Switching to this version of the values-bind macro removed assumptions about the value protocol from every call site, and made it easy to switch.

    The upshot of this is something that's, I'm sure, pretty common knowledge in Lisp/Scheme circles: macros are best when limited to syntax, with the underlying functionality implemented in a more functional interface. The functional interface keeps things more decoupled, even when compiled, and leaves your software more managable. It also provides a second way to 'get at' the functionality provided by the underlying code. With the function/macro split, the macro expansionn can be avoided entirely, in the case when you already have a closure that contains the code you need to run.

    One more brief example, a bit higher up the 'stack' in the language environment is the transformation of this macro:
    (defmacro (with-output-to-string . code)
      (with-gensyms (saved-output-port-sym output-string-sym)
        `(let ((,saved-output-port-sym (current-output-port))
               (,output-string-sym (open-output-string)))
           (unwind-protect (lambda ()
                             (set-current-output-port ,output-string-sym)
                             ,@code
                             (get-output-string ,output-string-sym))
                           (lambda ()
                             (set-current-output-port ,saved-output-port-sym))))))
    
    Into this macro/function pair:
    (define (call-with-output-to-string fn)
      (let ((saved-output-port (current-output-port))
            (output-string (open-output-string)))
        (unwind-protect (lambda ()
                          (set-current-output-port output-string)
                          (fn)
                          (get-output-string output-string))
                        (lambda ()
                          (set-current-output-port saved-output-port)))))
    
    (defmacro (with-output-to-string . code)
      `(call-with-output-to-string (lambda () ,@code)))
    


    reddit this! Digg Me!
  • [/tech/lisp] permanent link

    Thu, 14 Feb 2008
    CAR, CDR, and Lisp...
    A couple weeks ago, I got into a brief reddit discussion on the relative merits of Lisp's car and cdr functions. Given a Lisp list, applying car to the list returns the first element of the list and applying cdr to the list returns a list of every element excluding the first. For someone new to Lisp (as we all were once), these names can be a bit awkward. However, like many other aspects of the language, there is more to car and cdr than meets the eye.

    The first implementation of Lisp was done by Steve Russell on an IBM 704. The 704 was a 36-bit vacuum tube machine that IBM started selling in 1954. By the time it was discontinued in 1960, they had sold a total of 123 of the machines, each capable of a whopping 40,000 calculations per second. Russell's original 1959 implementation of Lisp on this machine took advantage of the fact that the 704's instruction set had special capabilities for accessing two distinct 15 bit fields of a 36 bit value loaded into a machine register: the address and decrement fields. In Russell's own words: "Because of an unfortunate temporary lapse of inspiration, we couldn't think of any other names for the 2 pointers in a list node than 'address' and 'decrement', so we called the functions CAR for 'Contents of Address of Register' and CDR for 'Contents of Decrement of Register'." Interestingly enough, he continues with this: "After several months and giving a few classes in LISP, we realized that 'first' and 'rest' were better names, and we (John McCarthy, I and some of the rest of the AI Project) tried to get people to use them instead. ... Alas, it was too late! We couldn't make it stick at all. So we have CAR and CDR. " So there you have it: car and cdr, two of the most famous and widely used functions of Lisp and its descendents, owe their names to a bizarre quirk of a computer architecture that's been obsolete for close to fifty years. For the record, here's a source listing for the original 704 implementataion of car taken from MIT AI Lab Memo 6:
    LXD JLOC, 4
    CLA 0,4
    PAX 0,4
    PSX 0,4
    
    But the story of car and cdr doesn't stop there. In the 1960's and early 70's, the University of Texas at Austin ran a computer called a CDC 6600. The 6600 was one of Seymour Cray's first big supercomputer designs, and was the fastest computer in the world for a time. It had a 60-bit machine word and an 18-bit address space, so you can probably see where this is going. The designers of UT's Lisp for the CDC 6600 added a third field to cons cells, giving them each three pointers, the car, cdr, and csr. I'm sure the third pointer was useful for implementing things like trees of nodes and lists of key/value pairs, although apparantly not useful enough to stick around. Two pointer cons cells are a better fit for modern hardware, and two two pointer cons cells can represent everything that a single three pointer cons cell can represent.

    Back at MIT in the 70's, and before things like defstruct, Maclisp took the idea of multi-pointer cons cells to what must be its logical extreme: hunks. A Maclisp hunk was a structure like a cons cell that could hold an arbitrary number of pointers, up to total of 512. Each of these slots in a hunk was referred to as a numbered cxr, with a numbering scheme that went like this: ( cxr-1 cxr-2 cxr-3 ... cxr-n cxr-0 ). No matter how many slots were in the hunk, car was equivalent to (cxr 1 hunk) and cdr was equivalent to (cxr 0 hunk). This is a nice generalization of the basic idea of a cons cell, but modern Lisps offer other ways to structure data that are both possibly more useful and more readable: structures for a fixed collection of named slots, hash tables for a variable collection of named slots, and vectors for a collection of numbered slots.

    After these historical blind alleys, it's interesting to think about why car and cdr still persist fifty years after McCarthy and Russell roamed the halls of MIT evangelising first and rest. Common Lisp does at least have first and rest as part of the standard. However, when I was taught Lisp in the mid-90's, I was encouraged to primarily favor the older car and cdr. I remember two primary reasons for this. The first was that existing code favored car and cdr, so it was important to be able to read code written in that style. The second reason was that first and rest impose a particular meaning on the fields of a cons cell that may or may not be appropriate. In the common case of a linear list, first and rest work rather well. If you call first on the list, you get the first element, if you call rest, you get the rest. In the case of a cons cell as a node of an association list, they work less well, unless, that is, you can figure out a reason why first makes sense as 'key', and rest makes sense as 'value'.

    Some of this confusion stems from the fact that most Lisps, despite the name List Processing, don't have an official list data type. What they have instead is a two element cons cell and a set of conventions in the library, reader, and writer for using the to make linear lists. In a sense, this is a lot like strings in C. C doesn't have a string type, what it has instead is a pointer to character data (char *) and a set of library conventions for using blocks of memory as strings of characters. This laxness on the part of both languages comes with the advantages and disadvantages you'd expect from letting the deatils of an underlying implementation leak through. In the case of C 'strings', the representation lends itself both to things like Rob Pike's beautiful regex implementation in The Practice of Programming and a seemingly never ending series of buffer overrun attacks. In the case of Lisp cons cells, it provides both an incredibly flexible data structure, and confusion over such basic notions as the 'first', 'second', and 'rest' of a list.

    If something as baroque as 'car' actually makes more sense than 'first' because 'first' doesn't match up well to underlying abstraction, it might them make sense to reconsider the underlying implementation. Some modern Lisps like Clojure do just that; A Clojure 'list' isn't a string of cons cells, but rather an instance of a JVM object that implements the interface ISeq:
    public interface ISeq extends IPersistentCollection{
       Object first();
       ISeq rest();
       ISeq cons(Object o);
    }
    
    Clojure's user-visible first and rest functions ultimately call into their like-named methods in ISeq:
    static public Object first(Object x){
    	ISeq seq = seq(x);
    	if(seq == null)
    		return null;
    	return seq.first();
    }
    
    // ...
    
    static public ISeq rest(Object x){
    	ISeq seq = seq(x);
    	if(seq == null)
    		return null;
    	return seq.rest();
    }
    
    A noteworthy difference between this and a 'conventional' Lisp is the return type of rest: it's another ISeq, rather than a Object. Because of this, the 'CDR' of a Clojure cons cell has a new constraint: it is constrained to be another sequence, increasing greatly the likelihood that 'rest' really is 'the rest'. While this could be done even if rest returned an Object, constraining rest to be a sequence eliminates a number of edge cases in the language that arise when you allow the rest of a list to be something other than a list itself. This altered representation also fits in nicely with Clojure's host JVM: there's nothing that says ISeq has to be implemented by a two-element pointer. Indeed, Closure "also implements first and rest for vectors, strings, arrays, maps, Java Iterables, lazily calculated and infinite sequences etc." All these implementations of ISeq make it easier for Clojure sequences to interoperate with Java, and it makes it easier to build a sequence library that works on all kinds of sequences. What's lost with this choice is the ability to use a cons cell as an informal two-element structure. Even then, this style of first and rest could co-exist with implementations of car and cdr that work the 'old way'.

    So in the end, maybe car and cdr are all right. They aren't the best names in the world, but they fit nicely the semantics of a unrestricted two-pointer cons cell. For those cases where you really are finding the first and rest of a list of cons cells, it is easy to use the first and rest functions in lieu of car and cdr. Then, for dialects like Clojure, your code is automatically portable to other sequence types. If you're using cons cells as a ad hoc structure, then you can either use car and cdr and accept those names as historical baggage of the second oldest major programming language, or investigate some of the other more modern ways of structuring data in Lisp programs.

    reddit this! Digg Me!

    [/tech/lisp] permanent link

    Fri, 04 Nov 2005
    SRFI-74: Octet-Addressed Binary Blocks
    Michael Sperver has written an SRFI that documents "Octet-Addressed Binary Blocks". Basically these things are like BLOBs in SQL: blocks of memory, opaque to the data model of the language, that can be used to store arbitrary binary data. I can think of a bunch of applications for this:
    • An internal representation for compiled byte code functions.
    • A way to interoperate with C code that expects binary data formats. (Like the Win32 API, for example. )
    • A way to represent binary data longer than a byte that's written to and read from binary ports.


    reddit this! Digg Me!

    [/tech/lisp] permanent link

    Mon, 08 Aug 2005
    Arc Hub
    Paul Graham solicited comments on his Arc programming language a few years ago. These comments are online, and are very interesting reading. Lots of good comments.

    reddit this! Digg Me!

    [/tech/lisp] permanent link

    Thu, 30 Jun 2005
    Mapping XML to S-Expressions
    I've been playing around with how to map XML to S-Expressions For a while, I had been considering a mapping like the following:

    From:<phone_book name="Jenny">867-5309</phone_book>
    To:(phone_book ((name . "Jenny")) "867-5309")

    In other words, a symbol for the tag name in the car of the list, an association list of attribute values in the cadr, and then the subelements in the cddr. This seems reasonable, aside from the fact that attributes and tag values are still wierdly disjoint.

    On the way to lunch today, I came up with another mapping that might be more reasonable:

    From:<phone_book name="Jenny">867-5309</phone_book>
    To:(phone_book (name "Jenny") :end-of-attribute-marker "867-5309")

    This is simpler in that a tag is modeled as a list containing the tag symbol and then all of the sub-items, attributes or not. Data stored as an attribute doesn't get special treatment relative to data stored as a tag value. The symbol :end-of-attribute-marker makes it possible to still distinguish between attributes and tags. If you don't care, a simple call to remove can remove the marker symbol.

    It's a subtle design point, but this'll probably end up in vCalc in the XML support... I've had XML for vCalc on the back-burner for a while now, but due to some real work obligations, I might have to make it a higher priority.

    reddit this! Digg Me!

    [/tech/lisp] permanent link

    Fri, 04 Mar 2005
    A few good Lisp and Scheme (and Smalltalk) Related Links
    It never ceases to amaze me how much good material there is online. Here's some more:

  • Olin Shiver's History of the T implemenatation of the Scheme programming language.
  • Aubrey Jaffer has some interesting material on interpreter performance issues at his SCM site.
  • Alan Kay's Early History of Smalltalk
  • In 1994, Richard Stallman started a debate (flamewar?) on comp.lang.tcl with a post entitled Why you should not use Tcl. The Guile scripting language was the logical outgrowth of this.
  • Kent Pitman has posted a bunch of his writings relating to Lisp and SCheme. Among other things, he edited the Common Lisp Hyperspec, which is an on-line version of the Common Lisp specification.
  • Peter Siebel has written a book, Practical Common Lisp, and has gotten permission to put it online indefinately.

    Ps: Be sure to check out Olin Shiver's philosophy of undergraduate advising. It's an example to be followed. ;-)

    reddit this! Digg Me!
  • [/tech/lisp] permanent link

    Wed, 02 Mar 2005
    I guess I had forgotten how slow I/O was, particularly bad I/O.
    I'm in the middle of developing a Scheme compiler for a future release of vCalc. While I've been developing the code, I've peppered it full of debugging print statements that look something like this:

    (format #t "compiling ~l, tail?=~l, value?=~l" form tail? value?)

    with the output statements in place, the compiler takes about 250-300ms to compile relatively small functions. Not great, particularly considering that there's no optimization being done at all. Anyway, on a hunch I removed the format statements, and execution time improved by a couple orders of magnitude to a millisecond or two per function. That's a lot closer to what I was hoping for at this stage of development.

    On the other hand, I hadn't realized that my (ad hoc, slapped together in an hour) format function was running quite that slowly. I think it'll end up being an interesting optimnization problem sooner or later.

    reddit this! Digg Me!

    [/tech/lisp] permanent link

    Tue, 08 Feb 2005
    A couple Lisp/Programming Language Blogs
    One interest of mine is programming languages, and more specifically, Lisp and Scheme. Lately, the blogosphere has produced a couple interesting blogs that tie to this interest:

  • Lambda the Ultimate
  • Planet Lisp

    Planet Lisp aggregates a bunch of Lisp-related blogs, while LtU is more general and more of a discussion site.

    Related, I came to LtU via Eric Lippert's Blog over on Weblogs @ASP.Net. Eric is a developer at Microsoft who's done a lot of work on Windows scripting and the Windows script languages.

    reddit this! Digg Me!

  • [/tech/lisp] permanent link