Mike Schaeffer's Blog

Articles with tag: lisp
March 26, 2014

Update 2019-01-17: KSM recently redesigned their website in a way that removes the original blog. Because of this, I've taken some of what I wrote then for KSM and re-hosted it here. Thanks are due both to KSM Technology Partners for allowing me to do this and to the Wayback Machine for retaining the content. All the links below are updated to reflect the articles' new locations.


Sorry for the radio silence, but recently I've been focusing my writing time on the KSM Techology Partners Blog. My writing there is still technical in nature, but it tends to be more heavily focused on the JVM. If you're interested, here are a few of what I consider to be the highlights.

In mid-2013, I started out writing about how to use Runnable to explictly enforce dynamic extent in Java. In a nutshell, this is a way to implement try...with...resources in versions of Java that don't have it built in to the language. I then used the dynamic extent technique to build a ThreadLocal that plays nicely with thread pools. This is useful because thread pools require an understanding of which thread you're running on, which thread pooling techniques can abstract away.

Later in the year, I focused more on Clojure, starting off with a quick bit on the relationship of lexical closures to Java inner classes. I also wrote about a particular kind of stack overflow exception that can happen with lazy sequences. Lazy sequences can nicely remove the need to use recursion while traversing their length, but each time two unrealized lazy sequences are combined, it adds to the recursive depth required to compute the first element. For me, this stack overflow was a difficult error to diagnose, because it seemed so counter-intuitive.

I'm also in the middle of a series of posts that relate the GoF command pattern to functional programming. The posts start off with Java, but will ultimately describe a Clojure implementation that compiles a stack based expression language into optimized Java bytecode. If you'd like to play with the code, it's on github.

May 30, 2012

In my Lisp programming, I find myself using Anaphoric Macros quite a bit. My first exposure to this type of macro (and deliberate variable capture) was in Paul Graham's On Lisp. Since I haven't been able to find Emacs Lisp implementations of these macos, I wrote my own.

The first of the two macros is an anaphoric version of the standard if special form:

(defmacro aif (test if-expr &optional else-expr)
  "An anaphoric variant of (if ...). The value of the test
expression is locally bound to 'it' during execution of the
consequent clauses. The binding is present in both consequent
branches."
  (declare (indent 1))
  `(let ((it ,test))
     (if it ,if-expr ,else-expr)))

The second macro is an anaphoric version of while:

(defmacro awhile (test &rest body)
  "An anaphoric varient of (while ...). The value of the test
expression is locally bound to 'it' during execution of the body
of the loop."
  (declare (indent 1))
  (let ((escape (gensym "awhile-escape-")))
    `(catch ',escape
       (while t
         (let ((it ,test))
           (if it
               (progn ,@body)
             (throw ',escape ())))))))

What both of these macros have in common is that they emulate an existing conditional special form, while adding a local binding that makes it possible to access the result of the condition. This is particularly useful in scenarios where a predicate function returns a true value that contains useful information beyond t or nil.

June 29, 2011

One typical property of Lisp systems is that they they intern symbols. Roughly speaking, when symbols are interned, two symbols with the same print name will also have the same identity. This design choice has several significant implications elsewhere in the Lisp implementation. It is also one of the places where Clojure differs from Lisp tradition.

In code, the most basic version of the intern algorithm is easy to express:

(define (intern! symbol-name)
   (unless (hash-has? *symbol-table* symbol-name)
      (hash-set! *symbol-table* symbol-name (make-symbol symbol-name)))
   (hash-ref *symbol-table* symbol-name))

This code returns the symbol with the given name in the global symbol table. If there's not already a symbol under that name in the global table, it creates a symbol with that name and stores it in the hash prior to returning it.. This ensures that make-symbol is only called once for each symbol-name, and the symbol stored in *symbol-table* is always the symbol returned for a given name. Any two calls to intern! with the same name are therefore guaranteed to return the exact, same, eq? symbol object. At a vCalc REPL, this looks like so (The fact that both symbols are printed with ##0 implies that they have the same identity.):

user> (intern! "test-symbol")
; ##0 = test-symbol
user> (intern! "test-symbol")
; ##0 = test-symbol

This design has several properties that have historially been useful when implementing a Lisp. First, by sharing the internal representation of symbols with the same print name, interning can reduce memory consumption. A careful programmer can write an implementation of interned symbols that doesn't allocate any memory on the heap unless it sees a new, distinct symbol. Interning also gives a (theoretically) cheaper mechanism for comparing two symbols for equality. Enforcing symbol identity equality for symbol name equality implies that symbol name equality can be reduced to a single machine instruction. In the early days of Lisp, these were very significant advantages. With modern hardware, they are less important. However, the semantics of interned symbols do still differ in important ways.

One example of this is that interned symbols make it easy to provide a global environment 'for free'. To see what I mean by this, here is the vCalc declaration of a symbol:

struct
{
     ...
     lref_t vcell;  // Current Global Variable Binding
     ...
} symbol;

Each symbol carries with it three fields that are specific to each symbol, and are created and initialized at the time the symbol is created. Because vcell for the symbol is created at the same time as the symbol, the global variable named by the symbol is created at the same time as the symbol itself. Accessing the value of that global variable is done through a field stored at an offset relative to the beginning of the symbol. These benefits also accrue to property lists, as they can also be stored in a field of a symbol. This is a cheap implementation strategy for global variables and property lists, but it comes at the cost of imposing a tight coupling between two distinct concepts: symbols and the global environment.

The upside of this coupling is that it encourages the use of global symbol attributes (bindings and properties). During interactive programming at a REPL, global bindings turn out to be useful because they make it easy to 'say the name' of the bindings to the environment. For bindingsthat directly map to symbols, the symbol itself is sufficient to name the binding and use it during debugging. Consider this definition:

(define *current-counter-value* 0)

(define (next-counter-value)
   (incr! *current-counter-value*)
   *current-counter-value*)

This definition of next-counter-value makes it easy to inspect the current counter value. It's stored in a global variable binding, so it can be inspected and modified during debugging using its name: *current-counter-value*. A more modular style of programming would store the current counter value in a binding local to the definition of next-counter-value:

(let ((current-counter-value 0))
  (define (next-counter-value)
    (incr! current-counter-value)
    current-counter-value))

This is 'better' from a stylistic point of view, because it reduces the scope of the binding of current-counter-value entirely to the scope of the next-counter-value function. This eliminates the chance that 'somebody else' will break encapsulation and modify the counter value in a harmful fashion. Unfortunately, 'somebody else' also includes the REPL itself. The 'better' design imposes the cost that it's no longer as easy to inspect or modify the current-counter-value from the REPL. (Doing so requires the ability to inspect or name the local bindings captured in the next-counter-value closure.)

The tight coupling between interned symbols and global variable bindings should not come as a suprise, because interning a symbol necessarily makes the symbol itself global. In a Lisp that interns symbols, the following code fragment creates two distinct local variable bindings, despite the fact that the bindings are named by the same, eq? symbol: local-variable.

(let ((local-variable 0))
   (let ((local-variable 0))
      local-variable))

The mismatch between globally interned symbols and local bindings implies that symbols cannot as directly be involved in talking about local bindings. A Common Lisp type declaration is an S-expression that says something about the variable named by a symbol.

(declare (fixnum el))

In contrast, a Clojure type declaration is a reader expression that attaches metadata to the symbol itself:

^String x

The ^ syntax in Clojure gathers up metadata and then applies it using withMeta to the next expression in the input stream. In the case of a type declaration, the metadata gets applied to the symbol naming the binding. This can be done in one of two ways. The first is to destructively update metadata attached to an interned symbol. If Clojure had done this, then each occurrance of symbol metadata would overwrite whatever metadata was there before, and that one copy of the metadata would apply to every occurance of the symbol in the source text. Every variable with the same name would have to have the same type declarations.

Clojure took the other approach, and avoids the problem by not interning symbols. This allows metadata to be bound to a symbol locally. In the Clojure equivalent of the local-variable example, there are two local variables and each are named by two distinct symbols (but with the same name).

(let [local-variable 0]
   (let [local-variable 0]
      local-variable))

The flexibility of this approch is useful, but it comes at the cost of losing the ability to store values in the symbols themselves. Global symbol property lists in Lisp have to be modeled using some other means. Global variable bindings are not stored in symbols, but rather in Vars. (This is something that compiled Lisps tend to do anyway.) These changes result in symbols in Clojure becoming slightly 'smaller' than in Lisp, and more well aligned with how they are used in moodern, lexically scoped Lisps. Because there are still global variable bindings availble in the language, the naming benefits of globals are still available for use in the REPL. It's taken a while for me to get there, but the overall effect of un-interned symbols on the design of a Lisp seems generally positive.

April 16, 2009

I recently was asked why I like Lisp. For me, it boils down to the fact that Lisp makes it easy to control when your code is evaluated. Most languages only let you evaluate code at runtime. There are ways around this (C++ templates, cpp, code generation, etc.), but they all have severe limitations. In contrast, Lisp makes it easy to run actual Lisp code at compile time (macros) or even read time (reader macros). Combine that with the fact that Lisp code is pretty easy to manipulate with Lisp itself, and it becomes much easier to do things that are usually restricted to language developers, which can be quite a force multiplier. Just to illustrate, most of the language features that C# has over Java (LINQ, properties, closures, lambda syntax, yield return etc.) could be added to Java by 'ordinary developers' if Java somehow had these things I like so much about Lisp.

But it doesn't...

April 10, 2008

A few months ago, I ran into a problem with a macro that seriously changed my opinions on how they should be used. It all comes down to the fact that macro are incorporated into compiler output. Two pieces of code that look nicely decoupled in the source text can end up very entwined with each other, once they are compiled.

To illustrate, I'll use the macro in question, something I once used to accept a sort of simulated 'multiple return value' in a dialect of Scheme. This is a low level example, something from my hobby work, but it can apply equally well to other uses of macros.

(defmacro (values-bind form vars . body)
  (with-gensyms (form-rv-sym)
    `(let ((,form-rv-sym ,form))
       (list-let ,vars (if (%values-tuple? ,form-rv-sym)
                           (slot-ref ,form-rv-sym 'v)
                           (list ,form-rv-sym))
         ,@body))))

This macro expands code like this:

(values-bind (returns-2-args 'foo) (arg-1 arg-2)
   (+ arg-1 arg-2))

Into code that looks like this:

(let ((#:form-rv-sym-69@00beeec4 (returns-2-args 'foo)))
   (list-let (arg-1 arg-2) (if (%values-tuple? #:form-rv-sym-69@00beeec4)
                              (slot-ref #:form-rv-sym-69@00beeec4 'v)
                              (list #:form-rv-sym-69@00beeec4))
     (+ arg-1 arg-2)))

And then, the compiler compiles that form and drops the result into the output file, which now contains several pretty deep assumptions about the simulated multiple value protocol it needs to honor:

  • Values are returned in a single value that satifies %values-tuple?.
  • Values are extracted from a tuple with a call to slot-ref for slot v.
  • Values are stored within slot as a list.

While the source text that uses values-bind doesn't need to know any of these details, the compiler output does. This results in compiler output that is very closely tied to the value protocol; Compiler output that is likely to be incompatible with any changes to that protocol.

In many development scenarios, this doesn't matter. Within a single project, if compiled file A comes to depend on assumptions embedded in macros from file B, it's less of an issue: both files are usually compiled at the same time. If both files can't be simultaneously compiled, things start to go wrong. I ran into this issue myself when trying to change the multiple value protocol I was using in my compiler. My core library was built with the old protocol, my new library was to be built with the new protocol, and the two could not interoperate for the brief period of time necessary to produce a compiled version of the new library. There are several possible approaches to solving this, but but one I took was the two step of building a new 'old' library that can handle both protocols, using it to compile a version that works only with the new protocol, and then switching over completely. It was a mess, and a mess I created myself with a macro that expanded into something that assumed way too much. The better approach, the approach that I switched to, is this:

(define (call-with-values proc vals)
  (apply proc (%values->list vals)))

(defmacro (values-bind form vars . body)
  `(call-with-values (lambda ,vars ,@body) ,form))

This expands the above code to something more palatable:

(call-with-values (lambda (arg-1 arg-2)
                     (+ arg-1 arg-2))
                  (returns-2-args 'foo))

The only assumption this makes in the compiled output is that there's a function call-with-values that calls its first argument with values passed in as its second argument. All of the gory details, which could easily be the same three from my list, are hidden behind function calls and dynamic linkage. This is actually the representation that made the two-step cutover approach plausible. Switching to this version of the values-bind macro removed assumptions about the value protocol from every call site, and made it easy to switch.

The upshot of this is something that's, I'm sure, pretty common knowledge in Lisp/Scheme circles: macros are best when limited to syntax, with the underlying functionality implemented in a more functional interface. The functional interface keeps things more decoupled, even when compiled, and leaves your software more managable. It also provides a second way to 'get at' the functionality provided by the underlying code. With the function/macro split, the macro expansionn can be avoided entirely, in the case when you already have a closure that contains the code you need to run.

One more brief example, a bit higher up the 'stack' in the language environment is the transformation of this macro:

(defmacro (with-output-to-string . code)
  (with-gensyms (saved-output-port-sym output-string-sym)
    `(let ((,saved-output-port-sym (current-output-port))
           (,output-string-sym (open-output-string)))
       (unwind-protect (lambda ()
                         (set-current-output-port ,output-string-sym)
                         ,@code
                         (get-output-string ,output-string-sym))
                       (lambda ()
                         (set-current-output-port ,saved-output-port-sym))))))

Into this macro/function pair:

(define (call-with-output-to-string fn)
  (let ((saved-output-port (current-output-port))
        (output-string (open-output-string)))
    (unwind-protect (lambda ()
                      (set-current-output-port output-string)
                      (fn)
                      (get-output-string output-string))
                    (lambda ()
                      (set-current-output-port saved-output-port)))))

(defmacro (with-output-to-string . code)
  `(call-with-output-to-string (lambda () ,@code)))
February 14, 2008

A couple weeks ago, I got into a brief reddit discussion on the relative merits of Lisp's car and cdr functions. Given a Lisp list, applying car to the list returns the first element of the list and applying cdr to the list returns a list of every element excluding the first. For someone new to Lisp (as we all were once), these names can be a bit awkward. However, like many other aspects of the language, there is more to car and cdr than meets the eye.

The first implementation of Lisp was done by Steve Russell on an IBM 704. The 704 was a 36-bit vacuum tube machine that IBM started selling in 1954. By the time it was discontinued in 1960, they had sold a total of 123 of the machines, each capable of a whopping 40,000 calculations per second. Russell's original 1959 implementation of Lisp on this machine took advantage of the fact that the 704's instruction set had special capabilities for accessing two distinct 15 bit fields of a 36 bit value loaded into a machine register: the address and decrement fields. In Russell's own words: "Because of an unfortunate temporary lapse of inspiration, we couldn't think of any other names for the 2 pointers in a list node than 'address' and 'decrement', so we called the functions CAR for 'Contents of Address of Register' and CDR for 'Contents of Decrement of Register'." Interestingly enough, he continues with this: "After several months and giving a few classes in LISP, we realized that 'first' and 'rest' were better names, and we (John McCarthy, I and some of the rest of the AI Project) tried to get people to use them instead. ... Alas, it was too late! We couldn't make it stick at all. So we have CAR and CDR. " So there you have it: car and cdr, two of the most famous and widely used functions of Lisp and its descendents, owe their names to a bizarre quirk of a computer architecture that's been obsolete for close to fifty years. For the record, here's a source listing for the original 704 implementataion of car taken from MIT AI Lab Memo 6:

LXD JLOC, 4
CLA 0,4
PAX 0,4
PSX 0,4

But the story of car and cdr doesn't stop there. In the 1960's and early 70's, the University of Texas at Austin ran a computer called a CDC 6600. The 6600 was one of Seymour Cray's first big supercomputer designs, and was the fastest computer in the world for a time. It had a 60-bit machine word and an 18-bit address space, so you can probably see where this is going. The designers of UT's Lisp for the CDC 6600 added a third field to cons cells, giving them each three pointers, the car, cdr, and csr. I'm sure the third pointer was useful for implementing things like trees of nodes and lists of key/value pairs, although apparantly not useful enough to stick around. Two pointer cons cells are a better fit for modern hardware, and two two pointer cons cells can represent everything that a single three pointer cons cell can represent.

Back at MIT in the 70's, and before things like defstruct, Maclisp took the idea of multi-pointer cons cells to what must be its logical extreme: hunks. A Maclisp hunk was a structure like a cons cell that could hold an arbitrary number of pointers, up to total of 512. Each of these slots in a hunk was referred to as a numbered cxr, with a numbering scheme that went like this: `( cxr-1 cxr-2 cxr-3 ... cxr-n cxr-0 ). No matter how many slots were in the hunk, car was equivalent to (cxr 1 hunk) and cdr was equivalent to (cxr 0 hunk)`. This is a nice generalization of the basic idea of a cons cell, but modern Lisps offer other ways to structure data that are both possibly more useful and more readable: structures for a fixed collection of named slots, hash tables for a variable collection of named slots, and vectors for a collection of numbered slots.

After these historical blind alleys, it's interesting to think about why car and cdr still persist fifty years after McCarthy and Russell roamed the halls of MIT evangelising first and rest. Common Lisp does at least have first and rest as part of the standard. However, when I was taught Lisp in the mid-90's, I was encouraged to primarily favor the older car and cdr. I remember two primary reasons for this. The first was that existing code favored car and cdr, so it was important to be able to read code written in that style. The second reason was that first and rest impose a particular meaning on the fields of a cons cell that may or may not be appropriate. In the common case of a linear list, first and rest work rather well. If you call first on the list, you get the first element, if you call rest, you get the rest. In the case of a cons cell as a node of an association list, they work less well, unless, that is, you can figure out a reason why first makes sense as 'key', and rest makes sense as 'value'.

Some of this confusion stems from the fact that most Lisps, despite the name List Processing, don't have an official list data type. What they have instead is a two element cons cell and a set of conventions in the library, reader, and writer for using the to make linear lists. In a sense, this is a lot like strings in C. C doesn't have a string type, what it has instead is a pointer to character data (char *) and a set of library conventions for using blocks of memory as strings of characters. This laxness on the part of both languages comes with the advantages and disadvantages you'd expect from letting the deatils of an underlying implementation leak through. In the case of C 'strings', the representation lends itself both to things like Rob Pike's beautiful regex implementation in The Practice of Programming and a seemingly never ending series of buffer overrun attacks. In the case of Lisp cons cells, it provides both an incredibly flexible data structure, and confusion over such basic notions as the 'first', 'second', and 'rest' of a list.

If something as baroque as 'car' actually makes more sense than 'first' because 'first' doesn't match up well to underlying abstraction, it might them make sense to reconsider the underlying implementation. Some modern Lisps like Clojure do just that; A Clojure 'list' isn't a string of cons cells, but rather an instance of a JVM object that implements the interface ISeq:

public interface ISeq extends IPersistentCollection{
   Object first();
   ISeq rest();
   ISeq cons(Object o);
}

Clojure's user-visible first and rest functions ultimately call into their like-named methods in ISeq:

static public Object first(Object x){
   ISeq seq = seq(x);
   if(seq == null)
      return null;
   return seq.first();
}

// ...

static public ISeq rest(Object x){
   ISeq seq = seq(x);
   if(seq == null)
      return null;
   return seq.rest();
}

A noteworthy difference between this and a 'conventional' Lisp is the return type of rest: it's another ISeq, rather than a Object. Because of this, the 'CDR' of a Clojure cons cell has a new constraint: it is constrained to be another sequence, increasing greatly the likelihood that 'rest' really is 'the rest'. While this could be done even if rest returned an Object, constraining rest to be a sequence eliminates a number of edge cases in the language that arise when you allow the rest of a list to be something other than a list itself. This altered representation also fits in nicely with Clojure's host JVM: there's nothing that says ISeq has to be implemented by a two-element pointer. Indeed, Closure "also implements first and rest for vectors, strings, arrays, maps, Java Iterables, lazily calculated and infinite sequences etc." All these implementations of ISeq make it easier for Clojure sequences to interoperate with Java, and it makes it easier to build a sequence library that works on all kinds of sequences. What's lost with this choice is the ability to use a cons cell as an informal two-element structure. Even then, this style of first and rest could co-exist with implementations of car and cdr that work the 'old way'.

So in the end, maybe car and cdr are all right. They aren't the best names in the world, but they fit nicely the semantics of a unrestricted two-pointer cons cell. For those cases where you really are finding the first and rest of a list of cons cells, it is easy to use the first and rest functions in lieu of car and cdr. Then, for dialects like Clojure, your code is automatically portable to other sequence types. If you're using cons cells as a ad hoc structure, then you can either use car and cdr and accept those names as historical baggage of the second oldest major programming language, or investigate some of the other more modern ways of structuring data in Lisp programs.

January 8, 2008

Lately, I've been thinking a bit about the way language design influences library design. My line of thought started out inspired by some of the recent conversations about closures in Java, but it ended up also touching on dynamic typing and a few other 'modern' language features. This will end up being more than one post, but I thought I'd record some of it in my blog, with the hope that it might shed some light for somebody, somewhere.

To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if `strcmp(internstring(a), internstring(b)) == 0 holds true, then internstring(a) == internstring(b)` also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.

Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:

  • int hash_has(hash_table_t ht, char *key) - Returns TRUE or FALSE, depending on whether or not the key key is found in the hash table ht.
  • char *hash_ref(hash_table_t ht, char *key) - Returns the value bound to the key key by the hash table ht. If the key is not found, behavior is undefined.
  • char *hash_set(hash_table_t ht, char *key, char *value) - Binds the value value to the key key in the hash table ht. If the key is already present, the existing value is overwritten. This function returns value.

Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.

hash_table_t intern_table; // assume this is initialized somewhere else.

char *intern_string(char *str) 
{
  if (hash_has(intern_table, str))
     return hash_ref(intern_table, str);

  char *interned_str = strdup(str);

  hash_set(intern_table, interned_str, interned_str);

  return interned_str;
}

The first step of intern_string is to check to see if the intern table already contains a string with the value of the new string. If the new string is already in the intern table, then the function returns the copy that's in the table. Otherwise, a new copy of the incoming string is created and stored in the hash table. In either case, the string returned is in the the intern table. This logic ensures that every time intern_string is called with a str of the same value, it returns the same exact string.

If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.

Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str = hash_ref(intern_table, str);

  if (interned_str == NULL) 
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

For this string interning, this change to hash_ref interface works fairly well. We know that we'll never store a hash key with a NULL value, so we know that NULL is safe to use for signaling that a key was not found. Were this ever to change, this version of hash_ref doesn't return enough information to distinguish between the 'key not found' case and the 'NULL value' case. Both would return NULL. To fix this, hash_ref needs to be extended to also return a seperate value that indicates if the key was found. This can be done in C by having hash_ref return the 'key found' flag as a return value, and also accept a pointer to a buffer that will contain the key's value, if it's found:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str;  

  if (!hash_ref(intern_table, str, &interned_str))
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

This is probably about as good as you can get in straight C. It easily distinguishes between the 'no-value' and 'no-key' cases, it's relatively clear to read, and it uses the common idioms of the language. However, C is a relatively sparse language. If you're willing to switch to something a bit more expressive, you have other choices.

One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:

(define *intern-table* (make-hash :equal))

(define (intern-string str)
 (let ((interned-str (hash-ref *intern-table* str 42)))
  (cond ((= interned-str 42)
          (hash-set! *intern-table* str str)
           str)
        (#t
          interned-str)))))

If you prefer C/JavaScript-style syntax, it looks like this:

var intern_table = make_hash(EQUAL);

function intern_string(str)

{
   var interned_str = hash_ref(intern_table, str, 42);

   if (interned_str == 42)
   {
       hash_set(intern_table, str, str);
       return str;
   }

   return interned_str;
}

In this case, hash_ref has been extended with a third argument: a default return value if the key is not found. The above code uses this to have hash_ref return a number in 'no value' case, and it's the type itself of this return value that ensures its distinctness. This is a common dynamic language idiom, but for a moment, consider what it would look like in C:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str = hash_ref(intern_table, str, (char *)42);

  if (interned_str == (char *)42) 
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

At first, this actually seems like it might a plausible implementation of intern_string. My guess is that it might even work most of the time. Where this implementation gets into trouble is the case when an interned string might reasonably be located at address 42. Because C is statically typed, When hash_ref returns, all it returns is a pointer. The caller cannot distinguish between the 'valid string at address 42' return value and the 'no-key' return value. This is basically the same problem as the case where we overloaded NULL to signal 'no-key'.

The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.

January 3, 2006

Despite outward appearances, there is (really!) another release of vCalc in the works. I'm not going to be so silly as to speak to a timeline (probably 2006Q2), but here's a brief list of planned features for the next release or two:

  • Constant Library - A library of a few hundred constants.
  • Interface improvements - The current UI is functional but rather plain both in appearance and the interactivity it supports. The next version of vCalc will dress up the UI bit and start the process of making it more interactive.
  • Macro Recorder - To aid programming, there's a macro recorder that records sequences of commands as programs written in the language described above.
  • New Data Types - There are more first-class data types, including complex numbers, lists, tagged numbers, and programs.
  • User Programability - There's a user programming language including conditional branches, loops, and higher order functions. This language looks a lot like a lexically scoped variant of RPL, the language used by HP in its more modern calculators.
  • Better interoperability with other data sources - This means import/export of CSV, through both the clipboard and by file.
  • Financial Math - This is mainly planned to be Time Value of Money. There's actually interface in vCalc 1.0 to support this functionality, but I released vCalc before getting it to work reliably and disabled the code that implements it. This is going to be an ongoing area for devevlopment.
  • Infix notation - There needs to be a way to enter an expression like 'sin(x)'. This is both a programmability feature and the core of things like symbolic algebra and calculus.
  • Graphics - Function plotting.

In a more general sense, there are a few other issues that are important, but have a slightly lower priority level. These are general issues that are too big to be 'fixed' in one release, but nonetheless are important areas for work. The first of these is performance and the second is openness.

Performance is the easier of the two issues to describe: I want vCalc to be usable to interactively perform simple (mean, max, min, linear regrssion, historgram, etc.) analysis of datasets with 100K-1,000K observations of 10-20 variables each. The worst case scenario means that vCalc needs to be able to manage a in-memory image around 500-600MB in size and be able to compute 20-30M floating point operations within 5-10 seconds. That's a stretch for vCalc, but I think it's doable within a year or two. Right now, development copies of vCalc can reasonably manage 100K observations of 50 variables each. The biggest weakness is the CSV file importer, which is glacially slow: it reads CSV files at around 30K/second. I'll speak to these issuses in more detail later on, but the fix for this will be a staged rewrite of the Lisp engine and garbage collector at the core of vCalc.

The other issue that will have to be fixed over time is the issue of openness. One of the things I'd like this blog to be is a way to communicate with the audience of vCalc users. That means example code, demonstrations of how to use vCalc to solve specific problems, and descriptions of the guts of vCalc, at the very least. For that to be useful, there needs to be an audience, and for there to be an audience the vCalc bits need to be availble for people to use and good enough for them to care about using it. There's a lot to be done between here and there, but I've come to believe that open sourcing parts of vCalc and releasing more frequent development builds of the closed source parts will end up being key. We'll see.

November 4, 2005

Michael Sperver has written an SRFI that documents "Octet-Addressed Binary Blocks". Basically these things are like BLOBs in SQL: blocks of memory, opaque to the data model of the language, that can be used to store arbitrary binary data. I can think of a bunch of applications for this:

  • An internal representation for compiled byte code functions.
  • A way to interoperate with C code that expects binary data formats. (Like the Win32 API, for example. )
  • A way to represent binary data longer than a byte that's written to and read from binary ports.
September 7, 2005

I've found a couple interesting websites related to computer history. The first is Dusty Decks, a blog related to some efforts to reconstruct Lisp and FORTRAN history. A highlight of this is a discussion on the Birth of the FORTRAN subroutine. Also via Dusty Decks is a website on the early history of the Lisp Programming Language.

That leads me to a couple books I've been reading lately. The first is Lisp in Small Pieces, by Christian Queinnec. I'm only a couple chapters in (stuck on continuations right now), but it's already been pretty profound. So far, the aspect of the book that's been the most useful is that it has gone through several core design choices Lisp implementors have to make ( Lisp-1 vs. Lisp-2, Lexical Scope vs. Dynamic Scope, types of continuations to support), and goes into depth regarding the implications and history of the choices involved. I think I'm finally starting to understand more of the significance of funcall and function in Common Lisp, not to mention throw/catch and block/return-from.

Book two is The First Computers–History and Architectures, edited by Raul Rojas. This book is a collection of papers discussing the architecture of significant early computers from the late 30's and 40's. The thing that's so unique about the book is that it focuses on the architectural issues surrounding these machines: the kinds of hardware they were built with, how they processed information, and how they were programmed. Just as an example, it has a detailed description of many of ENIAC's functional units, even going into descriptions of how problems were set up on the machine. Another highlight of the book for me (so far) has been a description of Konrad Zuse's relay-based Z3, down to the level of a system architectural diagram, schematics of a few key circuits, and coverage of its microprogramming (!).

Older Articles...