Mike Schaeffer's Blog

Articles with tag: programming
December 14, 2005

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

  • Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
  • Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
  • Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
  • Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
  • If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
  • Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
    • The version of the report specification
    • Name of the source application
    • Version of the source application (This version number should be updated with every build.)
    • Environment in which the source application was running to produce the report.
    • The date on which the report was run
    • If the report has an effective date, include it too.
  • Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.
November 4, 2005

I don't know who this person is, but they have a good collection of programming tips online. A lot of this stuff looks pretty relevant.

Related to that is this deck of slides written by Kent Pitman and Peter Norvig. It's an excellent discussion of good programming style in Lisp.

September 7, 2005

One of the first functions I like to write when creating a new data structure is a human-readable dumper. This is a simple function that takes the data you're working with and dumps it to an output stream in a readable way. I've found that these things can save huge amounts of debugging time: rather than paging through debugger watch windows, you can assess your program's state by calling a function and reading it out by eye.

A few tips for dump functions:

  • The more use this kind of scaffolding code gets, it gets progressively more cost effective to write. Time spent before dumpers are in place reduces the amount of use they can get and makes them progressively less cost effective. Implement them early, if you can.
  • Look for cheap alternatives already in your toolkit: Lisp can already print most of its structures, and .Net includes object serialization to XML. The standard solution might not be perfect, but it is 'free'.
  • Make sure your dumpers are correct from the outset. The whole point of this is to save debugging time later on, if you can't trust your view into your data structures during debugging, it will cost you time.
  • Dump into standard formats. If you can, dump into something like CSV, XML, S-expressions, or Dotty. If you have a considerable amount of data to analyze, this'll make it easier to use other tools to do some of the work.
  • Maintain your dumpers. Your software isn't going to go away, and neither are your data structures. If it's useful during initial development, it's likely to be useful during maintenance.
  • For structures that might be shared, or exist on the system heap, printing object addresses and reference counts can be very useful.
  • For big structures, it can be useful to elide specific content. For example: a list of 1000 items can be printed as (item0, item1, item2, ..., item999 ).
  • This stuff works for disk files too. For binary save formats, specific tooling to examine files can save time compared to an on-disk hex-editor/viewer. (Since you have code to read your disk format into a data structure in memory, if you also have code to dump your in-memory structure, this does not have to be much more work. Sharing code between the dump utility and the actual application also makes it more likely the dumper will show you the same view your application will see.)
  • Reading dumped structures back in can also be useful.
April 12, 2005

There's been some 'controversy' in the blog world about a petition that's circulating to ask Microsoft to continue supporting "Classic" Visual BASIC in addition to the replacement VB.Net. A month ago, I had a pretty long post dedicated to the topic, but due to technical problems I wasn't able to get it online. Therefore, I'll keep this sweet and to the point.

The core problem VB6 developers are facing is that they sank lots of development money into a closed, one-vendor language. Choosing VB6 basically amounted to a gamble that Microsoft would continue to support and develop the language for the duration of a project's active life. That gamble hasn't paid off for some developers, and companies with sizable investments in VB6 code now need to figure out how to make the most of that investment while still evolving their software.

With standardized languages like C, languages with multiple tool vendors, the risk is significantly lower. If one vendor drops their version of a language, switching to another implementation is going to be a lot easier than porting to an entirely different platform (particularly if you've avoiced or isolated vendor-specific features).

So... what's the moral of this story? Before you base your business on a particular language or tool, make sure you know what happens if that platform ever loses support. Pick something standardized, with multiple viable vendors. Or alternatively pick something open source, where you can take over platform development yourself (if you absolutely need to). Whatever you do, don't pick a one vendor tool and complain when the vendor decides to drop it. Commercial vendors, particularly, have no legal obligation to their customers.

April 12, 2005

Global variables tend get a bad rap, kind of like goto and pointers. Personally, I think they can be pretty useful if you are careful. Here are the guidelines I use to determine if a global variable is an appropriate solution:

  • Will there ever be a need for more than one instance of the variable?
  • How much complexity does passing the variable to all its accessors entail?
  • Does the variable represent global state? (A heap free list, configuration information, a pool of threads, a global mutex, etc.)
  • Can the data be more effectively modeled as a static variable in a function or private member variable in a singleton object? (Both of these are other forms of global storage, but they wrap the variable accesses in accessor functions.)
  • Can you support the lifecycle you need for the variable any other way? Global variables exist for the duration of your program's run-time. Local variables exist for the duration of a function. If you don't have heap allocated variables, or if your heap allocator sucks, then a global variable might be the best way to get to storage that lasts longer than any one function invocation.
  • Do you need to use environment features that are specific to globals? In MSVC++, this can mean things like specifying the segment in which a global is stored or declaring a variable as thread-local.

If all that leads you to the decision that a global variable is the best choice, you can then take steps to mitigate some of the risks involved. The first thing i'd do is prefix global variable names with a unique qualifier, maybe something like g_. This lowers the risk of namespace collisions as well as clearely denotes what variables are global, when you have to read or alter your code. If you have multiple global variables, I'd also be tempted to wrap them all up in a structure, for some of the same reasons.

March 4, 2005

I'm writing some content for a future post on the offshoring of jobs overseas, but I want to clear something up before it gets posted: Outsourcing and offshoring are two different and orthogonal concepts. This seems to be something that gets misunderstood a great deal, but simply put, outsourcing is the movement of jobs to a different company and offshoring is the movement of jobs to a different country. Either one can be done without the other.

The scenarios that people tend to get upset about (at least in the United States) are the scenarios involving offshoring, the movement of work overseas. Outsourcing, however, does not necessarily imply that the work gets moved to a different country: it's very common for work to be outsourced to another American business employing American workers. An example of this is hiring a Madison Avenue firm to put together an ad campaign. Sure, it'd be possible to develop the talent in house to do this yourself, but there are many advantages in outsourcing the work to a more specialized vendor.

March 2, 2005

I'm in the middle of developing a Scheme compiler for a future release of vCalc. While I've been developing the code, I've peppered it full of debugging print statements that look something like this:

(format #t "compiling ~l, tail?=~l, value?=~l" form tail? value?)

with the output statements in place, the compiler takes about 250-300ms to compile relatively small functions. Not great, particularly considering that there's no optimization being done at all. Anyway, on a hunch I removed the format statements, and execution time improved by a couple orders of magnitude to a millisecond or two per function. That's a lot closer to what I was hoping for at this stage of development.

On the other hand, I hadn't realized that my (ad hoc, slapped together in an hour) format function was running quite that slowly. I think it'll end up being an interesting optimnization problem sooner or later.

March 1, 2005

Idempotence has benefits at a program's run-time, as well as at build time. To illustrate, consider the case of a reference counted string. For the sake of example, it might be declared like this (In case you're wondering, no, I don't think this is a production-ready counted string library...):

struct CountedString					
{							
    int  _references;					
    char *_data;						
};             						
							
CountedString *makeString(char *data)			
{							
    CountedString cs = (CountedString *)malloc(sizeof(CountedString));	
							
    cs->_references = 1;					
    cs->_data = strdup(data);				
							
    return 1;						
}							
							
CountedString *referToString(CountedString *cs) 	
{							
    cs->_references++;					
    return cs;						
} 							 
							
void doneWithString(CountedString *cs) 			
{ 							
    cs->_references--; 					
							
    if (cs->_references == 0)				
    {
        free(cs->_data);					
        free(cs); 					
    } 							
}							
							
// ... useful library functions go here...

The reference counting mechanism buys you two things. It gives you the ability to delete strings when they're no longer accessible; It also gives you the abilty to avoid string copies by deferring them to the last possible moment. This second benefit, known as copy-on-write, is where idempotence can play a role. What copy on write entails is ensuring that whenever you write to a resource, you ensure that you have a copy unique to to yourself. If the copy you have isn't unique, copy-on-write requires that you duplicate the resource and modify the copy instead of the original. If you never modify the string, you never make the copy.

This means that the beginning of every string function that alters a string has to look something like this:

CountedString *alterString(CountedString *cs) 		
{							
    if (cs->_references > 1)				
    {							
        CountedString *uniqueString = makeString(cs->_data);	
        doneWithString(cs);					
        cs = uniqueString;					
    }								
    
    \\ ... now, cs can be modified at will			
								
     return cs; 
}	

Apply a little refactoring, and you get this...

CountedString *ensureUniqueInstance(CountedString *cs)		
{								
    if (cs->_references > 1)					
    {								
        CountedString *uniqueString = makeString(cs->_data);	
        doneWithString(cs);					
        cs = uniqueString;					
    }								
								
    return cs;
}

CountedString *alterString(CountedString *cs) 						
{											
    cs = ensureUniqueReference(cs);							
											
    \\ ... now, cs can be modified at will						
											
    return cs; 
}	

Of course, ensureUniqueInstance ends up being idempotent: it gets you into a known state from an unknown state, and it doesn't (semantically) matter if you call it too often. That's the key insight into why idempotence can be useful. Because idempotent processes don't rely on foreknowledge of your system's state to work reliably, they can be a predictable means to get into a known state. Also, If you hide idempotent processes behind the appropriate abstractions, they allow you to write code that's more self documenting. A function that begins with a line like cs = ensureUniqueInstance(cs); more clearly says to the reader that it needs a unique instance of cs than lines of code that check the reference count of cs and potentially duplicate it.

Next up are a few more examples of idempotence, as well as a look into some of the pitfalls.

February 22, 2005

Larry Osterman has been running a nice series of posts on issues related to thread synchronization and concurrency related issues. It's been full of useful tips and tricks, I particularly like part 2, Avoiding the Issue. That's a technique that's worked well for me in the multithreaded systems I've worked on. Of course, if you're writing SQL Server, etc. I'm sure you can't take nearly as simple an approach.

February 22, 2005

There's a good definition of the word idempotent over on Dictinoary.com. In a nutshell, the word is used to describe mathematical functions that satisfy the relationship f(x)=f(f(x)): functions for which repeated applications produce the same result as the first. For functions that satisfy this condition, you can rest assured that you can apply the function as many times as you like, get the expected result, and not screw anything up if you apply it more times than you absolutely need. This turns out to be a useful concept for people developing software systems.

One of the most common examples of this is in C-style include files. It's common practice to write code like this, to guard against multiple inclusions:

#ifndef __HEADER_FILE_GUARD
#define __HEADER_FILE_GUARD

// ... declarations go here...

#endif __HEADER_FILE_GUARD

This idiomatic C code protects the include file against multiple inclusions. Include files with this style of guard can be included as many times as you like with no ill effect.

The benefit to this is that it basically changes the meaning of the code #include <foo.h> from "Include these declarations" to "Ensure that these declarations have been made". That's a much safer kind of statement to make since it delgates the whole issue of multiple inclusions to a simple piece of automated logic.

Of course, this is pretty commonplace. More is to come...

Older Articles...