Mike Schaeffer's Blog

Articles with tag: programming

March 20, 2006

I recently converted a bunch of accessor macros in vCalc over to inline functions. These macros are a legacy of vCalc's siod heritage, and generally looked something like this:

#define FLONM(x) ((*x).storage_as.flonum.data)

In this macro, x is expected to be a pointer to an LObject, which is basically a struct consisting of union, storage_as, and a type field that specifies how to interpret the union. vCalc has a set of these macros to refer to fields in each possible interpretation of the storage_as union. These little jewels remove a lot of reduntant storage_as's from the code, and generally make the code much easier to work with. However, like all C/C++ macros, these macros have finicky syntax, bypass typechecking, and have limited ability to be extended to do other things (like verifying type fields, etc.). Fortunantly, they are a perfect candidate to be replaced by C++ inline functions and references:

inline flonum_t &FLONM(LRef x) 
{
   return ((*x).storage_as.flonum.data);
}

Even better, with function inlining turned on, there's no apparant performance penalty in making this transformation; even with inlining off, the penalty seems pretty modest at 20-20%. In other words, inline functions works exactly as advertised. It works well enough, if fact, that I made the 'logical extension', and added some type checking logic.

inline flonum_t &FLONM(LRef x)
{
   assert(FLONUMP(x));
   return ((*x).storage_as.flonum.data);
}

This version of the code verifies that x is actually a flonum before trying to interpret it as such. Normally, the code using these accessor functions explicitly checks the type of an object before making a reference, but sometimes, due to coding errors, invalid references can slip through the cracks. With the old style macros, these invalid references could result in data corruption with no warning. With checks, there's at least an attempt to check for bad references before they are made.

Adding these checks proved profitable: they revealed three logic errors in about 5 minutes of testing, two related to reading complex numbers, and one related to macroexpansion of a specific class of form. Adding these type checks also killed performance, but that was pretty easy to solve by making the additional checks independently switchable.

Tags:programmingtech

January 20, 2006

I hope I never get this cynical about our industry...

Look, the tech industry is and always will be fucked up. They still
somehow manage to make a semi-usable product every once in a while. My Mac
is slow as a dog, even though it has two CPUs and cost $5000, but I use it
anyway because it's prettier and slightly more fun than the crap Microsoft
and Dell ship. But give me a reason to switch, even a small one, and I'm
outta here.
— Dave Winer.

If you don't know who Dave Winer is, he pioneered the development of outlining software back in the 80's, developed one of the first system scripting tools for the Macintosh, invented XML-RPC, and invented the RSS specification you're probably using to read this blog post. I'm not trying to belittle this guy's point of view, but he's been responsible for developing several major pieces of consumer application software, designed a couple hugely significant internet procotols, and made some signficant money in the process. Most people should be so lucky.

Tags:programmingtech

December 14, 2005

Feeds and Reports

I've been doing a lot of analysis of feeds and reports lately, and have come up with a couple suggestions for file design that can make feeds easier to work with. None of this should be earth shattering advice, but collectively it can mean the difference between an easy file to work with and a complete pain in the ...well you know.

Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
Include a header that describes the feed. - To fully understand the contents of a file, you really have to understand what it contains and where it came from. This is useful both in testing (did this report come from build 28 or build 29?) and in production (when was this file generated?) My suggestions for header contents include:
- The version of the report specification
- Name of the source application
- Version of the source application (This version number should be updated with every build.)
- Environment in which the source application was running to produce the report.
- The date on which the report was run
- If the report has an effective date, include it too.
Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.

Tags:programmingtech

September 7, 2005

Programming Well: Write (and Read) your Data ASAP

One of the first functions I like to write when creating a new data structure is a human-readable dumper. This is a simple function that takes the data you're working with and dumps it to an output stream in a readable way. I've found that these things can save huge amounts of debugging time: rather than paging through debugger watch windows, you can assess your program's state by calling a function and reading it out by eye.

A few tips for dump functions:

The more use this kind of scaffolding code gets, it gets progressively more cost effective to write. Time spent before dumpers are in place reduces the amount of use they can get and makes them progressively less cost effective. Implement them early, if you can.
Look for cheap alternatives already in your toolkit: Lisp can already print most of its structures, and .Net includes object serialization to XML. The standard solution might not be perfect, but it is 'free'.
Make sure your dumpers are correct from the outset. The whole point of this is to save debugging time later on, if you can't trust your view into your data structures during debugging, it will cost you time.
Dump into standard formats. If you can, dump into something like CSV, XML, S-expressions, or Dotty. If you have a considerable amount of data to analyze, this'll make it easier to use other tools to do some of the work.
Maintain your dumpers. Your software isn't going to go away, and neither are your data structures. If it's useful during initial development, it's likely to be useful during maintenance.
For structures that might be shared, or exist on the system heap, printing object addresses and reference counts can be very useful.
For big structures, it can be useful to elide specific content. For example: a list of 1000 items can be printed as (item0, item1, item2, ..., item999 ).
This stuff works for disk files too. For binary save formats, specific tooling to examine files can save time compared to an on-disk hex-editor/viewer. (Since you have code to read your disk format into a data structure in memory, if you also have code to dump your in-memory structure, this does not have to be much more work. Sharing code between the dump utility and the actual application also makes it more likely the dumper will show you the same view your application will see.)
Reading dumped structures back in can also be useful.

Tags:programmingtech