Mike Schaeffer's Blog

Articles with tag: programming
December 14, 2005

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

  • Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
  • Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
  • Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
  • Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
  • If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
  • Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
    • The version of the report specification
    • Name of the source application
    • Version of the source application (This version number should be updated with every build.)
    • Environment in which the source application was running to produce the report.
    • The date on which the report was run
    • If the report has an effective date, include it too.
  • Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.
September 7, 2005

One of the first functions I like to write when creating a new data structure is a human-readable dumper. This is a simple function that takes the data you're working with and dumps it to an output stream in a readable way. I've found that these things can save huge amounts of debugging time: rather than paging through debugger watch windows, you can assess your program's state by calling a function and reading it out by eye.

A few tips for dump functions:

  • The more use this kind of scaffolding code gets, it gets progressively more cost effective to write. Time spent before dumpers are in place reduces the amount of use they can get and makes them progressively less cost effective. Implement them early, if you can.
  • Look for cheap alternatives already in your toolkit: Lisp can already print most of its structures, and .Net includes object serialization to XML. The standard solution might not be perfect, but it is 'free'.
  • Make sure your dumpers are correct from the outset. The whole point of this is to save debugging time later on, if you can't trust your view into your data structures during debugging, it will cost you time.
  • Dump into standard formats. If you can, dump into something like CSV, XML, S-expressions, or Dotty. If you have a considerable amount of data to analyze, this'll make it easier to use other tools to do some of the work.
  • Maintain your dumpers. Your software isn't going to go away, and neither are your data structures. If it's useful during initial development, it's likely to be useful during maintenance.
  • For structures that might be shared, or exist on the system heap, printing object addresses and reference counts can be very useful.
  • For big structures, it can be useful to elide specific content. For example: a list of 1000 items can be printed as (item0, item1, item2, ..., item999 ).
  • This stuff works for disk files too. For binary save formats, specific tooling to examine files can save time compared to an on-disk hex-editor/viewer. (Since you have code to read your disk format into a data structure in memory, if you also have code to dump your in-memory structure, this does not have to be much more work. Sharing code between the dump utility and the actual application also makes it more likely the dumper will show you the same view your application will see.)
  • Reading dumped structures back in can also be useful.
April 12, 2005

There's been some 'controversy' in the blog world about a petition that's circulating to ask Microsoft to continue supporting "Classic" Visual BASIC in addition to the replacement VB.Net. A month ago, I had a pretty long post dedicated to the topic, but due to technical problems I wasn't able to get it online. Therefore, I'll keep this sweet and to the point.

The core problem VB6 developers are facing is that they sank lots of development money into a closed, one-vendor language. Choosing VB6 basically amounted to a gamble that Microsoft would continue to support and develop the language for the duration of a project's active life. That gamble hasn't paid off for some developers, and companies with sizable investments in VB6 code now need to figure out how to make the most of that investment while still evolving their software.

With standardized languages like C, languages with multiple tool vendors, the risk is significantly lower. If one vendor drops their version of a language, switching to another implementation is going to be a lot easier than porting to an entirely different platform (particularly if you've avoiced or isolated vendor-specific features).

So... what's the moral of this story? Before you base your business on a particular language or tool, make sure you know what happens if that platform ever loses support. Pick something standardized, with multiple viable vendors. Or alternatively pick something open source, where you can take over platform development yourself (if you absolutely need to). Whatever you do, don't pick a one vendor tool and complain when the vendor decides to drop it. Commercial vendors, particularly, have no legal obligation to their customers.

April 12, 2005

Global variables tend get a bad rap, kind of like goto and pointers. Personally, I think they can be pretty useful if you are careful. Here are the guidelines I use to determine if a global variable is an appropriate solution:

  • Will there ever be a need for more than one instance of the variable?
  • How much complexity does passing the variable to all its accessors entail?
  • Does the variable represent global state? (A heap free list, configuration information, a pool of threads, a global mutex, etc.)
  • Can the data be more effectively modeled as a static variable in a function or private member variable in a singleton object? (Both of these are other forms of global storage, but they wrap the variable accesses in accessor functions.)
  • Can you support the lifecycle you need for the variable any other way? Global variables exist for the duration of your program's run-time. Local variables exist for the duration of a function. If you don't have heap allocated variables, or if your heap allocator sucks, then a global variable might be the best way to get to storage that lasts longer than any one function invocation.
  • Do you need to use environment features that are specific to globals? In MSVC++, this can mean things like specifying the segment in which a global is stored or declaring a variable as thread-local.

If all that leads you to the decision that a global variable is the best choice, you can then take steps to mitigate some of the risks involved. The first thing i'd do is prefix global variable names with a unique qualifier, maybe something like g_. This lowers the risk of namespace collisions as well as clearely denotes what variables are global, when you have to read or alter your code. If you have multiple global variables, I'd also be tempted to wrap them all up in a structure, for some of the same reasons.

Older Articles...