Mike Schaeffer's Blog

Articles with tag: programming
February 12, 2009

In the recent debate surronding the SOLID Principles of Object Oriented Design, the following two quotes stood out.

Last week I was listening to a podcast on Hanselminutes, with Robert Martin talking about the SOLID principles. ... And, when I was listening to them, they all sounded to me like extremely bureaucratic programming that came from the mind of somebody that has not written a lot of code, frankly.

Joel Spolsky

"Reading The Ferengi Programmer by Jeff Atwood really made me quite concerned. Here.s clearly an opinion which to me seems not grounded in sustained experience..."

Dhananjay Nene

Both of these are speculative slights on someone else's experience level, either generally or with a particular bit of technology. Bad rhetorical technique aside, my guess is that these are rooted in a fundamental lack of trust that the other side might actually have a well thought out reason for their point of view. This is an easy trap to fall into, particularly in a field as subjective as software design. Take the 'editor wars' as an example: which is better, Emacs, vi, or a full featured IDE? I don't know the answer to this question, but I do know that I can find people that will tell me I'm wrong for prefering Emacs. Change the debate to something a bit more relevant, something like the design of a large piece of software, and people get even more vitriolic.

At least part of the solution to this problem is plain, old trust. Think about a good developer that's moved into a lead role: it's easy to see how they might care enough about a particular design point to impose that on their team, either by implementing it themselves or by dictate. Where the trust comes in is in avoiding that trap. If I impose a choice on my team, I limit their ability to explore the design space themselves, take their own risks, and then potentially fail. I also limit their ability to correct my own misconceptions... if I think I'm right enough to mandate a design, I also probably think I'm right enough to ignore you and do my own thing anyway. Ironically enough, this makes the combination of conviction and risk aversion its own risk, and potentially a big one without a counterbalance. (From a personal level, if you go around imposing your will and/or ignoring points of view, you also lose the opporunity to learn from those around you.)

And this is where the bit about rhetorical technique comes into play. As satisfying as it can be to say that somebody you disagree with "...has not written a lot of code, frankly.", it's really beside the point. It doesn't matter, even were it true. What matters more to reasonable discussions about engineering technique are specific and testable statements: something like "Interface Segregation will help keep defect rates by promoting better unit tests." You may or may not agree with this statement, but it's more likely to lead to a relevant conversation than slights on experience or dogmatic declarations of opinion as fact. Several years ago I was told in no uncertain terms that I had made a design choice that 'wasn't scalable'. I ran some tests and came back with some numbers that showed my choice satisfied our requirements. Who do you think won that debate, the buzzword or the numbers? Specifics and testability can count for a lot. Dogma, not so much.

To be fair, most of the two blog posts I mention above are focused on meatier material than my quotes imply. I particularly liked Atwood's conclusion that "Rules, guidelines, and principles are gems of distilled experience that should be studied and respected. But they're never a substute [sic] for thinking critically about your work." For experienced developers, I expect that Nene would also agree. After all, he writes that "you have the experience on your side to generally make the right judgement calls and you are likely to anyway apply them under most of the cases." In a sense, both are arguing the same thing, namely that judgement ultimately drives the design process over strict adherence to a set of rules. The difference is that Nene takes it a few steps further and draws the conclusion that good developers produce good code, good code is SOLID, and Atwood's blog post is either useless or harmful. Maybe there are some valid points here, but they're obscured by a dogmatism that is more of a distraction than a productive way to think about software design.

February 22, 2008

The instructions I gave earlier on Renaming SVN Users work only when the SVN repository is hosted on a machine that can run SVN hooks written in Unix style shell script. On a conventional Windows machine, one without Cygwin, MSYS, or similar, you have to switch to writing hooks in something like Windows batch language.

If all you want to do is temporarily rename users, then you can just create an empty file named pre-revprop-change.cmd in your repository under hooks\. The default return code from a batch file is success, which SVN interprets as a all revision property changes, all the time, by anybody. If you want to implement an actual policy, Philibert Perusse has posted a template script online.

February 11, 2008

I've been keeping track of the vCalc source code in an SVN repository since May of 2005. While I'm the only person who has ever committed code into the repository, I've developed vCalc on three or four machines, with different usernames on each machine. Since SVN records usernames with each commit, these historical usernames show up in each svn log or svn blame. svn blame is particularly bad because it displays a code listing with the username prepended to each line in a fixed width gutter. With some usernames longer than others, usernames that are very long can exceed the width of the gutter and push the code over to the right. Fortunately, changing historical usernames isn't that hard, if you have administrator rights on your SVN repository.

SVN stores the name of a revision's committer in a revision property named svn:author. If you're not familar with the term, a revision property is a blob of out of band data that SVN attaches to the revision. In addition to the author of a commit, they're also used to store the commit log message, and, via SVN's propset and propget commands, user-provided custom metadata for the revision. Changing the name of a user associated with a commit basically amounts to using propset to update the svn:author property for a revision. The command to do this is structured like so:

svn propset svn:author –revprop -rrev-number new-username repository-URL

If this works, you are all set, but what is more likely to happen is the following error:

svn: Repository has not been enabled to accept revision propchanges; ask the administrator to create a pre-revprop-change hook

By default, revision property changes are disabled. This makes sense if you are at all interested in using your source code control system to satisfy an auditing requirement. Changing the author of a commit would be a great way for a developer to cover their tracks, if they were interested in doing something underhanded. Also, unlike most other aspects of a project managed in SVN, revision properties have no change tracking: They are the change tracking mechanism for everything else. Because of the security risks, enabling changes to revision properties requires establishment of a guard hook: an external procedure that is consulted whenever someone requests that a revision property be changed. Any policy decisions about who can change what revision property when are implemented in the hook procedure.

Hooks in SVN are stored in the hooks/ directory under the repository toplevel. Conveniently, SVN provides a sample implementation on the hook we need to implement in the shell script pre-revprop-change.tmpl, but the sample implementation also has strict defaults about what can be changed, allowing only the log message to be set:

if [ "$ACTION" = "M" -a "$PROPNAME" = "svn:log" ]; then exit 0; fi

echo "Changing revision properties other than svn:log is prohibited" > &2
exit 1

The sample script can be enabled by renaming it to pre-revprop-change. It can be made considerably more lax by adding an exit 0 before the lines I list above. At this point, the property update command should work, although if you're at all interested in the security of your repository, it is best to restore whatever revision property policy was in place as soon as possible.

January 21, 2008

Another one along the lines of My last post. I tried to compile this source file today, using the compiler in my little Lisp:

(define (values . args) (%panic "roh roh"))

(define (test x) (+ x 1))

I got the following result:

d:\test>vcsh -c test.scm
;;;; VCSH, Debug Build (SCAN 0.99 - Dec 17 2007 16:47:30)

; Info: Loading Internal File: fasl-compiler
; Info: Package 'fasl-compiler' created
; Info: Loading Internal File: fasl-write
; Info: Package 'fasl-write' created
; Info: Loading Internal File: fasl-compiler-run
; Info: Package 'fasl-compiler-run' created
; Info: stack limit disabled!
Fatal Error: roh roh @ (error.cpp:168)

Needless to say, fatal errors still aren't any good. However, this one is a bit more interesting than a simple type checking problem. The function %panic is the internal function used to signal fatal errors from Lisp code. The first definition above redefines values, the function to return multiple return values, so that it always panics with a fatal error. This is the kind of thing that, if done in a running environment, would break things almost immediately.

But, the compiler is slightly different.... it isolates the program being compiled from the compiler itself. This is done to keep redefinitions that might break the currently running compiler from doing just that. Redefinitions by the compiled program are only supposed to be visible to the compiled program. Since the above program never itself invokes values, it should never hit the call to %panic... except that it does.

What's happening here lies in the processing of the second definition. The definition itself is transformed a couple times by macroexpansion, first to this:

(%define test (named-lambda test (x) (+ x 1)))

And then, basically, to this:

(%define test (%lambda ((name . test) (lambda-list x)) (x) (+ x 1)))

The second macroexpansion step is the step that looks for optional arguments, and the internal function that parses lambda lists for optional arguments returns three values using values. This invocation of values happens in the environment of the program being compiled, so it hits the new %panic-invoking definition and the whole show grinds to a halt. The 'easy' fix, ensuring that macro expansion is isolated from potentially harmful redefinitions, won't work. Macro expansion has to happen in the user environment, so that macros can see function definitions that they might rely upon.

I don't have a unit test for the user/compiler seperation logic, so I thought when I started this blog post I was going to say something like: 'look, something else fundamentally broken, and without a test case'. That's interesting, but if you need convincing to write unit tests, you're probably already lost. What I actually learned while researching this post is a bit more subtle: it's a fundamental problem, but it's more about the design than the code itself. While the design I have for user/compiler seperation seems to work most of the time, it's not adequate to solve this kind of problem. I'm not yet exactly sure what the solution is, but it won't necessarily involve a missing unit test.

January 20, 2008

The other day, I had the following (abbreviated) dialog with my little Scheme interpreter:

scheme> (intern! 'xyzzy2 (find-package "keyword"))
; Fatal Error: Assertation Failed: STRINGP(pname) @ (oblist.cpp:451)
c:\vcalc>vcsh.exe

scheme> (intern! 12)
; Fatal Error: Assertation Failed: STRINGP(sym_name) @ (oblist.cpp:269)
c:\vcalc>

Needless to say, 'Fatal errors' aren't good things, and fatal errors in intern!, a core function, are even worse. Without going into too many details, the first call should be returning successfully, and the second should be throwing a runtime type check error. However, the implementation of intern! wasn't checking argument types and passing invalid arguments into lower layers of the interpreter's oblist (symbol table) code, which died with an assertation failure.

To put this in perspective, my implentation of intern! is about five years old, and something that I thought to be a fairly reliable piece of code. At the very least, I didn't think it was susceptable to something as simple as a type checking error that would crash the entire interpreter. Of course, when I looked at my test suite, there wasn't a set of tests for intern!. That might have something to do with it, don't you think?

Here are the morals I'm taking from this little story:

  • Do not assume something works, unless you have a complete test suite for it. (Even then be wary, because your test suite is probably not complete.)
  • Shoot for more than 60% code coverage on your test cases.
  • Don't write your own interpreter, because there are probably hundreds of other bugs just like this one. :-)
January 17, 2008

I don't usually write posts for the sole purpose of linking to other posts, but this is an exception. This is brilliant. What it is is the USDA's Food Pyramid but adapted to how programmers should spend their time. My one complaint is that it's way too focused on coding. My experience has been that it really pays to spend time on design work and learning to how to better interact with others, be they clients or team-mates. If you can design your way out of a rewrite, or work with your client to recast requirements to save complexity, it can be far more cost effective than even the best raw code.

January 8, 2008

Lately, I've been thinking a bit about the way language design influences library design. My line of thought started out inspired by some of the recent conversations about closures in Java, but it ended up also touching on dynamic typing and a few other 'modern' language features. This will end up being more than one post, but I thought I'd record some of it in my blog, with the hope that it might shed some light for somebody, somewhere.

To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if `strcmp(internstring(a), internstring(b)) == 0 holds true, then internstring(a) == internstring(b)` also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.

Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:

  • int hash_has(hash_table_t ht, char *key) - Returns TRUE or FALSE, depending on whether or not the key key is found in the hash table ht.
  • char *hash_ref(hash_table_t ht, char *key) - Returns the value bound to the key key by the hash table ht. If the key is not found, behavior is undefined.
  • char *hash_set(hash_table_t ht, char *key, char *value) - Binds the value value to the key key in the hash table ht. If the key is already present, the existing value is overwritten. This function returns value.

Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.

hash_table_t intern_table; // assume this is initialized somewhere else.

char *intern_string(char *str) 
{
  if (hash_has(intern_table, str))
     return hash_ref(intern_table, str);

  char *interned_str = strdup(str);

  hash_set(intern_table, interned_str, interned_str);

  return interned_str;
}

The first step of intern_string is to check to see if the intern table already contains a string with the value of the new string. If the new string is already in the intern table, then the function returns the copy that's in the table. Otherwise, a new copy of the incoming string is created and stored in the hash table. In either case, the string returned is in the the intern table. This logic ensures that every time intern_string is called with a str of the same value, it returns the same exact string.

If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.

Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str = hash_ref(intern_table, str);

  if (interned_str == NULL) 
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

For this string interning, this change to hash_ref interface works fairly well. We know that we'll never store a hash key with a NULL value, so we know that NULL is safe to use for signaling that a key was not found. Were this ever to change, this version of hash_ref doesn't return enough information to distinguish between the 'key not found' case and the 'NULL value' case. Both would return NULL. To fix this, hash_ref needs to be extended to also return a seperate value that indicates if the key was found. This can be done in C by having hash_ref return the 'key found' flag as a return value, and also accept a pointer to a buffer that will contain the key's value, if it's found:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str;  

  if (!hash_ref(intern_table, str, &interned_str))
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

This is probably about as good as you can get in straight C. It easily distinguishes between the 'no-value' and 'no-key' cases, it's relatively clear to read, and it uses the common idioms of the language. However, C is a relatively sparse language. If you're willing to switch to something a bit more expressive, you have other choices.

One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:

(define *intern-table* (make-hash :equal))

(define (intern-string str)
 (let ((interned-str (hash-ref *intern-table* str 42)))
  (cond ((= interned-str 42)
          (hash-set! *intern-table* str str)
           str)
        (#t
          interned-str)))))

If you prefer C/JavaScript-style syntax, it looks like this:

var intern_table = make_hash(EQUAL);

function intern_string(str)

{
   var interned_str = hash_ref(intern_table, str, 42);

   if (interned_str == 42)
   {
       hash_set(intern_table, str, str);
       return str;
   }

   return interned_str;
}

In this case, hash_ref has been extended with a third argument: a default return value if the key is not found. The above code uses this to have hash_ref return a number in 'no value' case, and it's the type itself of this return value that ensures its distinctness. This is a common dynamic language idiom, but for a moment, consider what it would look like in C:

hash_table_t intern_table;

char *intern_string(char *str) 
{
  char *interned_str = hash_ref(intern_table, str, (char *)42);

  if (interned_str == (char *)42) 
  {   
     interned_str = strdup(str);

     hash_set(intern_table, interned_str, interned_str);
  }

  return interned_str;
}

At first, this actually seems like it might a plausible implementation of intern_string. My guess is that it might even work most of the time. Where this implementation gets into trouble is the case when an interned string might reasonably be located at address 42. Because C is statically typed, When hash_ref returns, all it returns is a pointer. The caller cannot distinguish between the 'valid string at address 42' return value and the 'no-key' return value. This is basically the same problem as the case where we overloaded NULL to signal 'no-key'.

The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.

March 20, 2006

If you ever manually work with Visual Studio 2003 projects (*.vcproj files), be aware that projects have both names and GUIDs, which are usually assigned by the IDE. If you try to duplicate a project in a solution by copying it to a new directory, renaming it, and adding the copy to your solution, the MSVC IDE will get confused since you didn't give your new project a new GUID. Other things might be effected by this confusion too, but the inter-project dependancy editor definately can't distinguish between the two same-GUID projects. This ends up making it impossible to correctly sequence the two project builds, and there's no clue in the UI about the cause of the problem, since the dependancy editor lists the two same-GUID projects under seperate names.

I don't know if MSBuild, in VS2005, is any better, but they claim to have made it more friendly to non-IDE use cases. The strange thing about this is that I'm not sure what purpose the GUID's serve: I'd think that having multiple projects of the same name would create a host of other problems that the GUIDs wouldn't solve by themselves. Combine that with the outright user hotility of string like this one, it's easy to wonder why the GUIDs are used.

{4051A65D-4718-41AE-8C94-6B1906EB4D77} = {4051A65D-4718-41AE-8C94-6B1906EB4D77}
March 20, 2006

I recently converted a bunch of accessor macros in vCalc over to inline functions. These macros are a legacy of vCalc's siod heritage, and generally looked something like this:

#define FLONM(x) ((*x).storage_as.flonum.data)

In this macro, x is expected to be a pointer to an LObject, which is basically a struct consisting of union, storage_as, and a type field that specifies how to interpret the union. vCalc has a set of these macros to refer to fields in each possible interpretation of the storage_as union. These little jewels remove a lot of reduntant storage_as's from the code, and generally make the code much easier to work with. However, like all C/C++ macros, these macros have finicky syntax, bypass typechecking, and have limited ability to be extended to do other things (like verifying type fields, etc.). Fortunantly, they are a perfect candidate to be replaced by C++ inline functions and references:

inline flonum_t &FLONM(LRef x) 
{
   return ((*x).storage_as.flonum.data);
}

Even better, with function inlining turned on, there's no apparant performance penalty in making this transformation; even with inlining off, the penalty seems pretty modest at 20-20%. In other words, inline functions works exactly as advertised. It works well enough, if fact, that I made the 'logical extension', and added some type checking logic.

inline flonum_t &FLONM(LRef x)
{
   assert(FLONUMP(x));
   return ((*x).storage_as.flonum.data);
}

This version of the code verifies that x is actually a flonum before trying to interpret it as such. Normally, the code using these accessor functions explicitly checks the type of an object before making a reference, but sometimes, due to coding errors, invalid references can slip through the cracks. With the old style macros, these invalid references could result in data corruption with no warning. With checks, there's at least an attempt to check for bad references before they are made.

Adding these checks proved profitable: they revealed three logic errors in about 5 minutes of testing, two related to reading complex numbers, and one related to macroexpansion of a specific class of form. Adding these type checks also killed performance, but that was pretty easy to solve by making the additional checks independently switchable.

January 20, 2006

Look, the tech industry is and always will be fucked up. They still
somehow manage to make a semi-usable product every once in a while. My Mac
is slow as a dog, even though it has two CPUs and cost $5000, but I use it
anyway because it's prettier and slightly more fun than the crap Microsoft
and Dell ship. But give me a reason to switch, even a small one, and I'm
outta here.
Dave Winer.

If you don't know who Dave Winer is, he pioneered the development of outlining software back in the 80's, developed one of the first system scripting tools for the Macintosh, invented XML-RPC, and invented the RSS specification you're probably using to read this blog post. I'm not trying to belittle this guy's point of view, but he's been responsible for developing several major pieces of consumer application software, designed a couple hugely significant internet procotols, and made some signficant money in the process. Most people should be so lucky.

Older Articles...