Mike Schaeffer's Weblog
Mon, 21 Jan 2008
Still not tested... still not working... sort of...
Another one along the lines of My last post. I tried to compile this source file today, using the compiler in my little Lisp:
(define (values . args) (%panic "roh roh"))

(define (test x) (+ x 1))
I got the following result:
d:\test>vcsh -c test.scm
;;;; VCSH, Debug Build (SCAN 0.99 - Dec 17 2007 16:47:30)

; Info: Loading Internal File: fasl-compiler
; Info: Package 'fasl-compiler' created
; Info: Loading Internal File: fasl-write
; Info: Package 'fasl-write' created
; Info: Loading Internal File: fasl-compiler-run
; Info: Package 'fasl-compiler-run' created
; Info: stack limit disabled!
Fatal Error: roh roh @ (error.cpp:168)
Needless to say, fatal errors still aren't any good. However, this one is a bit more interesting than a simple type checking problem. The function %panic is the internal function used to signal fatal errors from Lisp code. The first definition above redefines values, the function to return multiple return values, so that it always panics with a fatal error. This is the kind of thing that, if done in a running environment, would break things almost immediately.

But, the compiler is slightly different.... it isolates the program being compiled from the compiler itself. This is done to keep redefinitions that might break the currently running compiler from doing just that. Redefinitions by the compiled program are only supposed to be visible to the compiled program. Since the above program never itself invokes values, it should never hit the call to %panic... except that it does.

What's happening here lies in the processing of the second definition. The definition itself is transformed a couple times by macroexpansion, first to this:
(%define test (named-lambda test (x) (+ x 1)))
And then, basically, to this:
(%define test (%lambda ((name . test) (lambda-list x)) (x) (+ x 1)))
The second macroexpansion step is the step that looks for optional arguments, and the internal function that parses lambda lists for optional arguments returns three values using values. This invocation of values happens in the environment of the program being compiled, so it hits the new %panic-invoking definition and the whole show grinds to a halt. The 'easy' fix, ensuring that macro expansion is isolated from potentially harmful redefinitions, won't work. Macro expansion has to happen in the user environment, so that macros can see function definitions that they might rely upon.

I don't have a unit test for the user/compiler seperation logic, so I thought when I started this blog post I was going to say something like: 'look, something else fundamentally broken, and without a test case'. That's interesting, but if you need convincing to write unit tests, you're probably already lost. What I actually learned while researching this post is a bit more subtle: it's a fundamental problem, but it's more about the design than the code itself. While the design I have for user/compiler seperation seems to work most of the time, it's not adequate to solve this kind of problem. I'm not yet exactly sure what the solution is, but it won't necessarily involve a missing unit test.

reddit this! Digg Me!

[/tech/programming] permanent link

Sun, 20 Jan 2008
Not tested? Then it doesn't work.
The other day, I had the following (abbreviated) dialog with my little Scheme interpreter:

scheme> (intern! 'xyzzy2 (find-package "keyword"))
; Fatal Error: Assertation Failed: STRINGP(pname) @ (oblist.cpp:451)
c:\vcalc>vcsh.exe

scheme> (intern! 12)
; Fatal Error: Assertation Failed: STRINGP(sym_name) @ (oblist.cpp:269)
c:\vcalc>

Needless to say, 'Fatal errors' aren't good things, and fatal errors in intern!, a core function, are even worse. Without going into too many details, the first call should be returning successfully, and the second should be throwing a runtime type check error. However, the implementation of intern! wasn't checking argument types and passing invalid arguments into lower layers of the interpreter's oblist (symbol table) code, which died with an assertation failure.

To put this in perspective, my implentation of intern! is about five years old, and something that I thought to be a fairly reliable piece of code. At the very least, I didn't think it was susceptable to something as simple as a type checking error that would crash the entire interpreter. Of course, when I looked at my test suite, there wasn't a set of tests for intern!. That might have something to do with it, don't you think?

Here are the morals I'm taking from this little story:

  • Do not assume something works, unless you have a complete test suite for it. (Even then be wary, because your test suite is probably not complete.)
  • Shoot for more than 60% code coverage on your test cases.
  • Don't write your own interpreter, because there are probably hundreds of other bugs just like this one. :-)

    reddit this! Digg Me!
  • [/tech/programming] permanent link

    Thu, 17 Jan 2008
    The programmer's 'food' pyramid.
    I don't usually write posts for the sole purpose of linking to other posts, but this is an exception. This is brilliant. What it is is the USDA's Food Pyramidbut adapted to how programmers should spend their time. My one complaint is that it's way too focused on coding. My experience has been that it really pays to spend time on design work and learning to how to better interact with others, be they clients or team-mates. If you can design your way out of a rewrite, or work with your client to recast requirements to save complexity, it can be far more cost effective than even the best raw code.

    reddit this! Digg Me!

    [/tech/programming] permanent link

    Sat, 12 Jan 2008
    Cingular 2125 Followup
    Last June, I wrote a bit on my experiences with the Cingular 2125 Windows Smartphone. After more than a year, the phone has been a good choice, but there have been several suprises, for both the good and the bad.

  • This is the first phone I've used with a web browser that's usable for general web surfing. Most sites render reasonably correctly, and the display is large enough to contain a useful amount of content. It's still not perfect, the browser crashes too often and it is difficult to log into reddit, but this is a vast improvement over conventional phones.
  • I installed a 1GB SD Card that is borderline useless. This might be different if I'd been more aggressively installing software or music, but as it is, the primary benefit of having a card like this is that I can now take 40,000 pictures before I run out of space.
  • Outlook integration is still incredibly useful, but it's been harder to keep the calendar in sync than I thought. This is probably due to the fact I get lots of meeting invites that change, but it's made it difficult to rely on the phone as the 'authoritative' source for my scheduling information I hoped it would be.
  • J2ME is a non-starter on this phone. There is a JVM, but it's buried under a submenu and the applications running on it look more like 'steerage class' than 'first class' citizens of the phone. They aren't integrated with the main application launcher, and their interfaces look like something out of 1988. I really get the impression that the phone has J2ME solely for the purpose of selling into corporate clients with a requirement to run custom J2ME code.
  • Given the power of the underlying hardware and the quality of the display, I was hoping to find more games for the phone. My previous two phones both had small collections of J2ME games purchased through my service provider's web site. AT&T has finally started adding games for this phone to their site, but the selection is limited, expensive, and not that great. I did at least find a few games elsewhere that are pretty fun, Atomic Cannon and Nethack. These were both pretty easy to install. Atomic Cannon, in particular, demonstrates the graphics of the phone quite well.
  • I don't use the 'Phone as Modem' capability at all. I don't have as many places where I need to use it as I thought. That said, it does work, and would be a nice way to check mail in a pinch.

    reddit this! Digg Me!
  • [/tech/products] permanent link

    Wed, 09 Jan 2008
    Ant-up
    In my career, I've done a bit of switching back and forth between Emacs and various IDE's. One of the IDE features I've come to depend on is quick access to the compiler. Typically, IDE's make it possible to compile your project with a keystroke, and then navigate from error to error at the press of a key. It's easy to recreate this in Emacs. The following two expressions make Emacs work a lot like Visual Studio in this regard.
    (global-set-key [(shift f5)] 'compile)
    (global-set-key [f12] 'next-error)
    
    After these forms are evaluated, pressing Shift-F5 invokes the compile command, which asks for a command to be run in an inferior shell, typically make, ant, or some other build utility. The catch is that it runs the command in the directory of the current buffer, which implies that the build script can be found in the same directory as the current source file. For a Java project with a per-package directory hierarchy, this is often not true. There are probably a bunch of ways to fix this, but I've solved it with a Windows NT batch file, ant-up.bat, that repeatedly searches up the directory hierarchy for build.xml. I just compile with ant-up, rather than a direct invocation of ant. This is not the most elegant solution, I'm sure, but it took about five minutes to implement and works well.
    @echo off
    
    setlocal
    
    
    :retry
    
    set last_path=%CD%
    
    echo Searching %CD% ...
    
    if exist build.xml goto compile-now
    
    cd ..
    
    if "%last_path%"=="%CD%" goto abort
    
    goto retry
    
    :compile-now
    
    call ant -k %1 %2 %3 %4 %5
    
    if errorlevel 1 goto failure
    
    goto success
    
    :abort
    
    echo build.xml not found... compile failed
    
    :failure
    
    exit /b 1
    
    :success
    
    exit /b 0
    


    reddit this! Digg Me!

    [/tech/programming] permanent link

    Tue, 08 Jan 2008
    Function Call Interfaces and Dynamic Typing
    Lately, I've been thinking a bit about the way language design influences library design. My line of thought started out inspired by some of the recent conversations about closures in Java, but it ended up also touching on dynamic typing and a few other 'modern' language features. This will end up being more than one post, but I thought I'd record some of it in my blog, with the hope that it might shed some light for somebody, somewhere.

    To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if strcmp(intern_string(a), intern_string(b)) == 0 holds true, then intern_string(a) == intern_string(b) also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.

    Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:

  • int hash_has(hash_table_t ht, char *key) - Returns TRUE or FALSE, depending on whether or not the key key is found in the hash table ht.
  • char *hash_ref(hash_table_t ht, char *key) - Returns the value bound to the key key by the hash table ht. If the key is not found, behavior is undefined.
  • char *hash_set(hash_table_t ht, char *key, char *value) - Binds the value value to the key key in the hash table ht. If the key is already present, the existing value is overwritten. This function returns value.
    Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.
       hash_table_t intern_table; // assume this is initialized somewhere else.
    
       char *intern_string(char *str) 
       {
         if (hash_has(intern_table, str))
            return hash_ref(intern_table, str);
        
         char *interned_str = strdup(str);
      
         hash_set(intern_table, interned_str, interned_str);
    
         return interned_str;
       }
    
    The first step of intern_string is to check to see if the intern table already contains a string with the value of the new string. If the new string is already in the intern table, then the function returns the copy that's in the table. Otherwise, a new copy of the incoming string is created and stored in the hash table. In either case, the string returned is in the the intern table. This logic ensures that every time intern_string is called with a str of the same value, it returns the same exact string.

    If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.

    Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:
       hash_table_t intern_table;
    
       char *intern_string(char *str) 
       {
         char *interned_str = hash_ref(intern_table, str);
    
         if (interned_str == NULL) 
         {   
            interned_str = strdup(str);
      
            hash_set(intern_table, interned_str, interned_str);
         }
    
         return interned_str;
       }
    
    For this string interning, this change to hash_ref interface works fairly well. We know that we'll never store a hash key with a NULL value, so we know that NULL is safe to use for signaling that a key was not found. Were this ever to change, this version of hash_ref doesn't return enough information to distinguish between the 'key not found' case and the 'NULL value' case. Both would return NULL. To fix this, hash_ref needs to be extended to also return a seperate value that indicates if the key was found. This can be done in C by having hash_ref return the 'key found' flag as a return value, and also accept a pointer to a buffer that will contain the key's value, if it's found:
       hash_table_t intern_table;
    
       char *intern_string(char *str) 
       {
         char *interned_str;  
    
         if (!hash_ref(intern_table, str, &interned_str))
         {   
            interned_str = strdup(str);
      
            hash_set(intern_table, interned_str, interned_str);
         }
    
         return interned_str;
       }
    
    This is probably about as good as you can get in straight C. It easily distinguishes between the 'no-value' and 'no-key' cases, it's relatively clear to read, and it uses the common idioms of the language. However, C is a relatively sparse language. If you're willing to switch to something a bit more expressive, you have other choices.

    One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:
     
       (define *intern-table* (make-hash :equal))
    
       (define (intern-string str)
        (let ((interned-str (hash-ref *intern-table* str 42)))
         (cond ((= interned-str 42)
                 (hash-set! *intern-table* str str)
                  str)
               (#t
                 interned-str)))))
    
    If you prefer C/JavaScript-style syntax, it looks like this:
     
       var intern_table = make_hash(EQUAL);
    
       function intern_string(str)
    
       {
          var interned_str = hash_ref(intern_table, str, 42);
    
          if (interned_str == 42)
          {
              hash_set(intern_table, str, str);
              return str;
          }
    
          return interned_str;
       }
    
    In this case, hash_ref has been extended with a third argument: a default return value if the key is not found. The above code uses this to have hash_ref return a number in 'no value' case, and it's the type itself of this return value that ensures its distinctness. This is a common dynamic language idiom, but for a moment, consider what it would look like in C:
       hash_table_t intern_table;
    
       char *intern_string(char *str) 
       {
         char *interned_str = hash_ref(intern_table, str, (char *)42);
    
         if (interned_str == (char *)42) 
         {   
            interned_str = strdup(str);
      
            hash_set(intern_table, interned_str, interned_str);
         }
    
         return interned_str;
       }
    
    At first, this actually seems like it might a plausible implementation of intern_string. My guess is that it might even work most of the time. Where this implementation gets into trouble is the case when an interned string might reasonably be located at address 42. Because C is statically typed, When hash_ref returns, all it returns is a pointer. The caller cannot distinguish between the 'valid string at address 42' return value and the 'no-key' return value. This is basically the same problem as the case where we overloaded NULL to signal 'no-key'.

    The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.

    reddit this! Digg Me!
  • [/tech/programming] permanent link

    Long Time, No See.
    It seems that it's been a while since I've posted: about four months. That's longer then I meant, but isn't that always the case?

    In the four months since I've not been posting, Ryan has crawled, learn to walk, learned to talk a little, and learned to respond to simple questions. Personally speaking, I've worked a bit on vCalc, not to mention the more important bill-paying work of my full time day job. Personally, I think Ryan is making me look like a slacker, but I suppose that's a matter of judgement. :-)

    Anyway, I hope your holiday season was all you wanted it to be, and Happy New Year. I have a few ideas for new posts, so with some luck, the next gap won't be four months long.

    reddit this! Digg Me!

    [/personal] permanent link