Mike Schaeffer's Weblog
Fri, 22 Feb 2008
The instructions I gave earlier on Renaming
SVN Users work only when the SVN repository is hosted on a machine that can run
SVN hooks written in Unix style shell script. On a conventional Windows machine, one
without Cygwin, MSYS, or similar, you have to switch to writing hooks in something
like Windows batch language.
If all you want to do is temporarily rename users, then you can just create an empty file named pre-revprop-change.cmd in your repository under hooks\. The default return code from a batch file is success, which SVN interprets as a all revision property changes, all the time, by anybody. If you want to implement an actual policy, Philibert Pérusse has posted a template script online.
reddit this! Digg Me!
If all you want to do is temporarily rename users, then you can just create an empty file named pre-revprop-change.cmd in your repository under hooks\. The default return code from a batch file is success, which SVN interprets as a all revision property changes, all the time, by anybody. If you want to implement an actual policy, Philibert Pérusse has posted a template script online.
reddit this! Digg Me!
[/tech/programming] permanent link
Mon, 11 Feb 2008
I've been keeping track of the vCalc source code in an SVN
repository since May of 2005. While I'm the only person who has
ever committed code into the repository, I've developed vCalc on
three or four machines, with different usernames on each
machine. Since SVN records usernames with each commit, these
historical usernames show up in each svn
log or svn
blame. svn blame is particularly bad because it
displays a code listing with the username prepended to each line
in a fixed width gutter. With some usernames longer than others,
usernames that are very long can exceed the width of the gutter
and push the code over to the right. Fortunately, changing
historical usernames isn't that hard, if you have administrator
rights on your SVN repository.
SVN stores the name of a revision's committer in a revision property named svn:author. If you're not familar with the term, a revision property is a blob of out of band data that SVN attaches to the revision. In addition to the author of a commit, they're also used to store the commit log message, and, via SVN's propset and propget commands, user-provided custom metadata for the revision. Changing the name of a user associated with a commit basically amounts to using propset to update the svn:author property for a revision. The command to do this is structured like so:
Hooks in SVN are stored in the hooks/ directory under the repository toplevel. Conveniently, SVN provides a sample implementation on the hook we need to implement in the shell script pre-revprop-change.tmpl, but the sample implementation also has strict defaults about what can be changed, allowing only the log message to be set:
reddit this! Digg Me!
SVN stores the name of a revision's committer in a revision property named svn:author. If you're not familar with the term, a revision property is a blob of out of band data that SVN attaches to the revision. In addition to the author of a commit, they're also used to store the commit log message, and, via SVN's propset and propget commands, user-provided custom metadata for the revision. Changing the name of a user associated with a commit basically amounts to using propset to update the svn:author property for a revision. The command to do this is structured like so:
svn propset svn:author --revprop -rrev-number new-username repository-URLIf this works, you are all set, but what is more likely to happen is the following error:
svn: Repository has not been enabled to accept revision propchanges; ask the administrator to create a pre-revprop-change hookBy default, revision property changes are disabled. This makes sense if you are at all interested in using your source code control system to satisfy an auditing requirement. Changing the author of a commit would be a great way for a developer to cover their tracks, if they were interested in doing something underhanded. Also, unlike most other aspects of a project managed in SVN, revision properties have no change tracking: They are the change tracking mechanism for everything else. Because of the security risks, enabling changes to revision properties requires establishment of a guard hook: an external procedure that is consulted whenever someone requests that a revision property be changed. Any policy decisions about who can change what revision property when are implemented in the hook procedure.
Hooks in SVN are stored in the hooks/ directory under the repository toplevel. Conveniently, SVN provides a sample implementation on the hook we need to implement in the shell script pre-revprop-change.tmpl, but the sample implementation also has strict defaults about what can be changed, allowing only the log message to be set:
if [ "$ACTION" = "M" -a "$PROPNAME" = "svn:log" ]; then exit 0; fi echo "Changing revision properties other than svn:log is prohibited" >&2 exit 1The sample script can be enabled by renaming it to pre-revprop-change. It can be made considerably more lax by adding an exit 0 before the lines I list above. At this point, the property update command should work, although if you're at all interested in the security of your repository, it is best to restore whatever revision property policy was in place as soon as possible.
reddit this! Digg Me!
[/tech/programming] permanent link
Mon, 21 Jan 2008
Another one along the lines of My
last post. I tried to compile this source file today, using the
compiler in my little Lisp:
But, the compiler is slightly different.... it isolates the program being compiled from the compiler itself. This is done to keep redefinitions that might break the currently running compiler from doing just that. Redefinitions by the compiled program are only supposed to be visible to the compiled program. Since the above program never itself invokes values, it should never hit the call to %panic... except that it does.
What's happening here lies in the processing of the second definition. The definition itself is transformed a couple times by macroexpansion, first to this:
I don't have a unit test for the user/compiler seperation logic, so I thought when I started this blog post I was going to say something like: 'look, something else fundamentally broken, and without a test case'. That's interesting, but if you need convincing to write unit tests, you're probably already lost. What I actually learned while researching this post is a bit more subtle: it's a fundamental problem, but it's more about the design than the code itself. While the design I have for user/compiler seperation seems to work most of the time, it's not adequate to solve this kind of problem. I'm not yet exactly sure what the solution is, but it won't necessarily involve a missing unit test.
reddit this! Digg Me!
(define (values . args) (%panic "roh roh")) (define (test x) (+ x 1))I got the following result:
d:\test>vcsh -c test.scm ;;;; VCSH, Debug Build (SCAN 0.99 - Dec 17 2007 16:47:30) ; Info: Loading Internal File: fasl-compiler ; Info: Package 'fasl-compiler' created ; Info: Loading Internal File: fasl-write ; Info: Package 'fasl-write' created ; Info: Loading Internal File: fasl-compiler-run ; Info: Package 'fasl-compiler-run' created ; Info: stack limit disabled! Fatal Error: roh roh @ (error.cpp:168)Needless to say, fatal errors still aren't any good. However, this one is a bit more interesting than a simple type checking problem. The function %panic is the internal function used to signal fatal errors from Lisp code. The first definition above redefines values, the function to return multiple return values, so that it always panics with a fatal error. This is the kind of thing that, if done in a running environment, would break things almost immediately.
But, the compiler is slightly different.... it isolates the program being compiled from the compiler itself. This is done to keep redefinitions that might break the currently running compiler from doing just that. Redefinitions by the compiled program are only supposed to be visible to the compiled program. Since the above program never itself invokes values, it should never hit the call to %panic... except that it does.
What's happening here lies in the processing of the second definition. The definition itself is transformed a couple times by macroexpansion, first to this:
(%define test (named-lambda test (x) (+ x 1)))And then, basically, to this:
(%define test (%lambda ((name . test) (lambda-list x)) (x) (+ x 1)))The second macroexpansion step is the step that looks for optional arguments, and the internal function that parses lambda lists for optional arguments returns three values using values. This invocation of values happens in the environment of the program being compiled, so it hits the new %panic-invoking definition and the whole show grinds to a halt. The 'easy' fix, ensuring that macro expansion is isolated from potentially harmful redefinitions, won't work. Macro expansion has to happen in the user environment, so that macros can see function definitions that they might rely upon.
I don't have a unit test for the user/compiler seperation logic, so I thought when I started this blog post I was going to say something like: 'look, something else fundamentally broken, and without a test case'. That's interesting, but if you need convincing to write unit tests, you're probably already lost. What I actually learned while researching this post is a bit more subtle: it's a fundamental problem, but it's more about the design than the code itself. While the design I have for user/compiler seperation seems to work most of the time, it's not adequate to solve this kind of problem. I'm not yet exactly sure what the solution is, but it won't necessarily involve a missing unit test.
reddit this! Digg Me!
[/tech/programming] permanent link
Sun, 20 Jan 2008
The other day, I had the following (abbreviated) dialog with my little
Scheme interpreter:
Needless to say, 'Fatal errors' aren't good things, and fatal errors in intern!, a core function, are even worse. Without going into too many details, the first call should be returning successfully, and the second should be throwing a runtime type check error. However, the implementation of intern! wasn't checking argument types and passing invalid arguments into lower layers of the interpreter's oblist (symbol table) code, which died with an assertation failure.
To put this in perspective, my implentation of intern! is about five years old, and something that I thought to be a fairly reliable piece of code. At the very least, I didn't think it was susceptable to something as simple as a type checking error that would crash the entire interpreter. Of course, when I looked at my test suite, there wasn't a set of tests for intern!. That might have something to do with it, don't you think?
Here are the morals I'm taking from this little story:
reddit this! Digg Me!
scheme> (intern! 'xyzzy2 (find-package "keyword")) ; Fatal Error: Assertation Failed: STRINGP(pname) @ (oblist.cpp:451) c:\vcalc>vcsh.exe scheme> (intern! 12) ; Fatal Error: Assertation Failed: STRINGP(sym_name) @ (oblist.cpp:269) c:\vcalc>
Needless to say, 'Fatal errors' aren't good things, and fatal errors in intern!, a core function, are even worse. Without going into too many details, the first call should be returning successfully, and the second should be throwing a runtime type check error. However, the implementation of intern! wasn't checking argument types and passing invalid arguments into lower layers of the interpreter's oblist (symbol table) code, which died with an assertation failure.
To put this in perspective, my implentation of intern! is about five years old, and something that I thought to be a fairly reliable piece of code. At the very least, I didn't think it was susceptable to something as simple as a type checking error that would crash the entire interpreter. Of course, when I looked at my test suite, there wasn't a set of tests for intern!. That might have something to do with it, don't you think?
Here are the morals I'm taking from this little story:
reddit this! Digg Me!
[/tech/programming] permanent link
Thu, 17 Jan 2008
I don't usually write posts for the sole purpose of linking to other
posts, but this is an exception. This is
brilliant. What it is is the USDA's
Food Pyramidbut adapted to how programmers should spend their time.
My one complaint is that it's way too focused on coding. My experience
has been that it really pays to spend time on design work and learning to
how to better interact with others, be they clients or team-mates. If you
can design your way out of a rewrite, or work with your client to recast
requirements to save complexity, it can be far more cost effective than
even the best raw code.
reddit this! Digg Me!
reddit this! Digg Me!
[/tech/programming] permanent link
Wed, 09 Jan 2008
In my career, I've done a bit of switching back and forth between Emacs and various
IDE's. One of the IDE features I've come to depend on is quick access
to the compiler. Typically, IDE's make it possible to compile your
project with a keystroke, and then navigate from error to error at the
press of a key. It's easy to recreate this in Emacs. The following two
expressions make Emacs work a lot like Visual Studio in this regard.
reddit this! Digg Me!
(global-set-key [(shift f5)] 'compile) (global-set-key [f12] 'next-error)After these forms are evaluated, pressing Shift-F5 invokes the compile command, which asks for a command to be run in an inferior shell, typically make, ant, or some other build utility. The catch is that it runs the command in the directory of the current buffer, which implies that the build script can be found in the same directory as the current source file. For a Java project with a per-package directory hierarchy, this is often not true. There are probably a bunch of ways to fix this, but I've solved it with a Windows NT batch file, ant-up.bat, that repeatedly searches up the directory hierarchy for build.xml. I just compile with ant-up, rather than a direct invocation of ant. This is not the most elegant solution, I'm sure, but it took about five minutes to implement and works well.
@echo off setlocal :retry set last_path=%CD% echo Searching %CD% ... if exist build.xml goto compile-now cd .. if "%last_path%"=="%CD%" goto abort goto retry :compile-now call ant -k %1 %2 %3 %4 %5 if errorlevel 1 goto failure goto success :abort echo build.xml not found... compile failed :failure exit /b 1 :success exit /b 0
reddit this! Digg Me!
[/tech/programming] permanent link
Tue, 08 Jan 2008
Lately, I've been thinking a bit about the way language design
influences library design. My line of thought started out inspired by
some of the recent conversations about closures in Java, but it ended
up also touching on dynamic typing and a few other 'modern' language
features. This will end up being more than one post, but I thought
I'd record some of it in my blog, with the hope that it might shed
some light for somebody, somewhere.
To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if strcmp(intern_string(a), intern_string(b)) == 0 holds true, then intern_string(a) == intern_string(b) also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.
Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:
Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.
If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.
Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:
One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:
The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.
reddit this! Digg Me!
To motivate this discussion, I'll use as an example a simple C implementation of a string-interning function, intern_string. If you're not familiar with the concept of interning, the premise is that interning two objects ensures that if they have the same value, they also have the same identity. In the case of C strings, interning ensures that if strcmp(intern_string(a), intern_string(b)) == 0 holds true, then intern_string(a) == intern_string(b) also holds true. Since it effectively means that each string value is only stored one time, this technique can reduce memory requirements. It also gives you a cheap string equality comparison: checking two interned strings for equality reduces to a pointer comparison, which is about as fast as it gets.
Given a hash table that compares keys by value, implementing the function string_intern is fairly simple. In the following code code, intern_table is a hash table that maps strings to themselves. hash_ref, hash_set, and hash_has are functions that manipulate the hash table:
Note the critical assumption that the hash_* accessors implement key comparison by value sementics, strcmp, rather than identity semantics, ==.
hash_table_t intern_table; // assume this is initialized somewhere else.
char *intern_string(char *str)
{
if (hash_has(intern_table, str))
return hash_ref(intern_table, str);
char *interned_str = strdup(str);
hash_set(intern_table, interned_str, interned_str);
return interned_str;
}
The first step of intern_string is to check to see if the
intern table already contains a string with the value of the new
string. If the new string is already in the intern table, then the
function returns the copy that's in the table. Otherwise, a new copy
of the incoming string is created and stored in the hash table. In
either case, the string returned is in the the intern table. This
logic ensures that every time intern_string is called with a
str of the same value, it returns the same exact string.
If you haven't guessed already, the problem with this implementation of intern_string lies in the dual calls to hash_has and hash_ref. Both calls involve searching the hash table for the key: hash_has to determine if the key exists, and hash_ref to retrieve the key's value. This means that in the common case, interning a string that's already been interned, this implementaion searches the hash table twice. Double work.
Fixing this problem involves changing the calling conventions for hash_ref. One of the simplest ways to do this involves defining a specific return value that hash_ref can return in the 'key not found' case. For strings in C, NULL is a logical choice. This change to hash_ref makes it possible to remove the double search by eliminating the explicit hash_has check:
hash_table_t intern_table;
char *intern_string(char *str)
{
char *interned_str = hash_ref(intern_table, str);
if (interned_str == NULL)
{
interned_str = strdup(str);
hash_set(intern_table, interned_str, interned_str);
}
return interned_str;
}
For this string interning, this change to hash_ref interface
works fairly well. We know that we'll never store a hash key with a
NULL value, so we know that NULL is safe to use for
signaling that a key was not found. Were this ever to change, this
version of hash_ref doesn't return enough information to
distinguish between the 'key not found' case and the 'NULL
value' case. Both would return NULL. To fix this,
hash_ref needs to be extended to also return a seperate value
that indicates if the key was found. This can be done in C by having
hash_ref return the 'key found' flag as a return value, and
also accept a pointer to a buffer that will contain the key's value,
if it's found:
hash_table_t intern_table;
char *intern_string(char *str)
{
char *interned_str;
if (!hash_ref(intern_table, str, &interned_str))
{
interned_str = strdup(str);
hash_set(intern_table, interned_str, interned_str);
}
return interned_str;
}
This is probably about as good as you can get in straight C. It
easily distinguishes between the 'no-value' and 'no-key' cases, it's
relatively clear to read, and it uses the common idioms of the
language. However, C is a relatively sparse language. If you're
willing to switch to something a bit more expressive, you have other
choices.
One example of this is a choice that's particularly well supported by dynamically typed languages. With a language that can identify types at runtime, it becomes possible for hash_ref to return values of a different type if the key is not found. This value can be distinguished from other return values by virtue of the run time type identification supported by the language. In one such language, Scheme, this lets us implement intern-string like this:
(define *intern-table* (make-hash :equal))
(define (intern-string str)
(let ((interned-str (hash-ref *intern-table* str 42)))
(cond ((= interned-str 42)
(hash-set! *intern-table* str str)
str)
(#t
interned-str)))))
If you prefer C/JavaScript-style syntax, it looks like this:
var intern_table = make_hash(EQUAL);
function intern_string(str)
{
var interned_str = hash_ref(intern_table, str, 42);
if (interned_str == 42)
{
hash_set(intern_table, str, str);
return str;
}
return interned_str;
}
In this case, hash_ref has been extended with a third
argument: a default return value if the key is not found. The above
code uses this to have hash_ref return a number in 'no value'
case, and it's the type itself of this return value that ensures its
distinctness. This is a common dynamic language idiom, but for a
moment, consider what it would look like in C:
hash_table_t intern_table;
char *intern_string(char *str)
{
char *interned_str = hash_ref(intern_table, str, (char *)42);
if (interned_str == (char *)42)
{
interned_str = strdup(str);
hash_set(intern_table, interned_str, interned_str);
}
return interned_str;
}
At first, this actually seems like it might a plausible implementation
of intern_string. My guess is that it might even work most of
the time. Where this implementation gets into trouble is the case when
an interned string might reasonably be located at address 42. Because
C is statically typed, When hash_ref returns, all it returns
is a pointer. The caller cannot distinguish between the 'valid string
at address 42' return value and the 'no-key' return value. This is
basically the same problem as the case where we overloaded NULL
to signal 'no-key'.
The way the dynamically typed language solved this problem is worth considering. When a dynamically typed language passes a value, what it's really doing is returning a pointer along with a few extra bits describing the type of the object being pointed to. (Runtime implementations might vary, but that's the gist of many.) Using dynamic typing to distinguish between those two possible cases really amounts to using those few extra type bits to contain 'another' return value, one holding information on whether or not the key was found. This is exactly what our 'best' C implementation does explicitly with a return value and a reference value. The dynamic typing isn't necessarily adding any expressive power, but it is giving another, concise means of expressing what we're trying to say.
reddit this! Digg Me!
[/tech/programming] permanent link
Wed, 14 Dec 2005
I've been doing a lot of analysis of feeds and reports lately, and
have come up with a couple suggestions for file design that can make
feeds easier to work with. None of this should be earth shattering
advice, but collectively it can mean the difference between an
easy file to work with and a complete pain in the ...well you know.
reddit this! Digg Me!
- Prefer machine readable formats - "Pretty printers" for reports have a lot of utility: they can make it easy for users to view and understand results. However, they also have disadvantages: it's harder to use "pretty" reports for the further downstream processing that someone will inevitably want to do. This is something that needs to be considered carefully, keeping your audience in mind, but if possible, pick a format that a machine can easily work with.
- Use a standard file format - There are lots of standard formats available for reports and feeds: XML, CSV, Tab Delimited, S-Expression, INI File, etc. Use one of these. Tools already exist to process and manipulate these kinds of files, and one of these formats will be able to contain your data.
- Prefer the simplest format that will work - The simpler the format, the easier it will be to parse/handle. CSV is a good example of this: XML is sexier and much more powerful, but CSV has been around forever and has many more tools. A good example of what I mean is XML support in Excel. Excel has been getting XML support in the most recent versions, but it's had CSV support since the beginning. Also, from a conceptual standpoint, anybody who can understand a spreadsheet can understand a tabular file, but hierarchical data is considerably more complex a concept. (In business settings, there's a very good chance your feed/report audience will be business analysts that know Excel backwards and forwards but have no technical computer science training.)
- Prefer delimited formats to formats based on field widths - The thing about having columns based on field widths (column 1 is 10 characters wide, column 2 is 20, etc.) is that you have to remember and specify the field widths when you want to extract out the tabular data. In the worst case, without the column widths you can't read your file at all. In the best case, it's just something else you have to do when you load a file.
- If you specify column names, ensure they are unique. - This isn't necessary for a lot of data analysis tools, but some tools (cough... MS Access) get confused when importing a table with multiple columns of the same name.
- Include a header that describes the feed. - To fully understand the contents of a
file, you really have to understand what it contains and where it came from. This is useful
both in testing (did this report come from build 28 or build 29?) and in production (when was
this file generated?) My suggestions for header contents include:
- The version of the report specification
- Name of the source application
- Version of the source application (This version number should be updated with every build.)
- Environment in which the source application was running to produce the report.
- The date on which the report was run
- If the report has an effective date, include it too.
- Document your report - Without good, precise documention of your file format, it'll be very hard to reliably consume files in the format. Similarly, have as many people as possible peer review your file format. Even if your system's code is complete garbage, the file format represents an interface to your system that will possibly live much longer than the system itself.
reddit this! Digg Me!
[/tech/programming] permanent link
Fri, 04 Nov 2005
I don't know who this person is, but they have a good
collection of programming
tips online. A lot of this stuff looks pretty relevant.
Related to that is this deck of slides written by Kent Pitman and Peter Norvig. It's an excellent discussion of good programming style in Lisp.
reddit this! Digg Me!
Related to that is this deck of slides written by Kent Pitman and Peter Norvig. It's an excellent discussion of good programming style in Lisp.
reddit this! Digg Me!
[/tech/programming] permanent link
Wed, 07 Sep 2005
One of the first functions I like to write when creating
a new data structure is a human-readable dumper. This
is a simple function that takes the data you're working
with and dumps it to an output stream in a readable way.
I've found that these things can save huge amounts of
debugging time: rather than paging through debugger watch
windows, you can assess your program's state by calling
a function and reading it out by eye.
A few tips for dump functions:
reddit this! Digg Me!
- The more use this kind of scaffolding code gets, it gets progressively more cost effective to write. Time spent before dumpers are in place reduces the amount of use they can get and makes them progressively less cost effective. Implement them early, if you can.
- Look for cheap alternatives already in your toolkit: Lisp can already print most of its structures, and .Net includes object serialization to XML. The standard solution might not be perfect, but it is 'free'.
- Make sure your dumpers are correct from the outset. The whole point of this is to save debugging time later on, if you can't trust your view into your data structures during debugging, it will cost you time.
- Dump into standard formats. If you can, dump into something like CSV, XML, S-expressions, or Dotty. If you have a considerable amount of data to analyze, this'll make it easier to use other tools to do some of the work.
- Maintain your dumpers. Your software isn't going to go away, and neither are your data structures. If it's useful during initial development, it's likely to be useful during maintenance.
- For structures that might be shared, or exist on the system heap, printing object addresses and reference counts can be very useful.
- For big structures, it can be useful to elide specific content. For example: a list of 1000 items can be printed as (item_0, item_1, item_2, ..., item_999 ).
- This stuff works for disk files too. For binary save formats, specific tooling to examine files can save time compared to an on-disk hex-editor/viewer. (Since you have code to read your disk format into a data structure in memory, if you also have code to dump your in-memory structure, this does not have to be much more work. Sharing code between the dump utility and the actual application also makes it more likely the dumper will show you the same view your application will see.)
- Reading dumped structures back in can also be useful.
reddit this! Digg Me!
[/tech/programming] permanent link
Tue, 12 Apr 2005
Global variables tend get a bad rap, kind of like goto and pointers.
Personally, I think they can be pretty useful if you are careful.
Here are the guidelines I use to determine if a global variable
is an appropriate solution:
reddit this! Digg Me!
- Will there ever be a need for more than one instance of the variable?
- How much complexity does passing the variable to all its accessors
entail?
- Does the variable represent global state? (A heap free list, configuration
information, a pool of threads, a global mutex, etc.)
- Can the data be more effectively modeled as a static variable in a
function or private member variable in a singleton object? (Both
of these are other forms of global storage, but they wrap the variable
accesses in accessor functions.)
- Can you support the lifecycle you need for the variable any other way? Global
variables exist for the duration of your program's run-time. Local variables
exist for the duration of a function. If you don't have heap allocated
variables, or if your heap allocator sucks, then a global variable might
be the best way to get to storage that lasts longer than any one function
invocation.
- Do you need to use environment features that are specific to globals?
In MSVC++, this can mean things like
specifying the segment in which a global is stored or declaring a variable as
thread-local.
reddit this! Digg Me!
[/tech/programming] permanent link
Tue, 01 Mar 2005
Idempotence has benefits at a program's run-time, as well as at
build time. To illustrate, consider the case of a reference
counted string. For the sake of example, it might be declared like
this (In case you're wondering, no, I don't think this is a
production-ready counted string library...):
The reference counting mechanism buys you two things. It gives you the ability to delete strings when they're no longer accessible; It also gives you the abilty to avoid string copies by deferring them to the last possible moment. This second benefit, known as copy-on-write, is where idempotence can play a role. What copy on write entails is ensuring that whenever you write to a resource, you ensure that you have a copy unique to to yourself. If the copy you have isn't unique, copy-on-write requires that you duplicate the resource and modify the copy instead of the original. If you never modify the string, you never make the copy.
This means that the beginning of every string function that alters a string has to look something like this:
Apply a little refactoring, and you get this...
Of course,
Next up are a few more examples of idempotence, as well as a look into some of the pitfalls.
reddit this! Digg Me!
struct CountedString
{
int _references;
char *_data;
};
CountedString *makeString(char *data)
{
CountedString cs = (CountedString *)malloc(sizeof(CountedString));
cs->_references = 1;
cs->_data = strdup(data);
return 1;
}
CountedString *referToString(CountedString *cs)
{
cs->_references++;
return cs;
}
void doneWithString(CountedString *cs)
{
cs->_references--;
if (cs->_references == 0)
{
free(cs->_data);
free(cs);
}
}
// ... useful library functions go here...
The reference counting mechanism buys you two things. It gives you the ability to delete strings when they're no longer accessible; It also gives you the abilty to avoid string copies by deferring them to the last possible moment. This second benefit, known as copy-on-write, is where idempotence can play a role. What copy on write entails is ensuring that whenever you write to a resource, you ensure that you have a copy unique to to yourself. If the copy you have isn't unique, copy-on-write requires that you duplicate the resource and modify the copy instead of the original. If you never modify the string, you never make the copy.
This means that the beginning of every string function that alters a string has to look something like this:
CountedString *alterString(CountedString *cs)
{
if (cs->_references > 1)
{
CountedString *uniqueString = makeString(cs->_data);
doneWithString(cs);
cs = uniqueString;
}
\\ ... now, cs can be modified at will
return cs;
}
Apply a little refactoring, and you get this...
CountedString *ensureUniqueInstance(CountedString *cs)
{
if (cs->_references > 1)
{
CountedString *uniqueString = makeString(cs->_data);
doneWithString(cs);
cs = uniqueString;
}
return cs;
}
CountedString *alterString(CountedString *cs)
{
cs = ensureUniqueReference(cs);
\\ ... now, cs can be modified at will
return cs;
}
Of course,
ensureUniqueInstance ends up being idempotent: it gets you
into a known state from an unknown state, and it doesn't (semantically) matter if
you call it too often. That's the key insight into why idempotence can be useful.
Because idempotent processes don't rely on foreknowledge of your system's state to work reliably,
they can be a predictable means to get into a known state. Also, If you hide
idempotent processes behind the appropriate abstractions, they allow you to write
code that's more self documenting. A function that begins with a line like
cs = ensureUniqueInstance(cs); more clearly says to the reader that it
needs a unique instance of cs than lines of code that check the reference count of
cs and potentially duplicate it.
Next up are a few more examples of idempotence, as well as a look into some of the pitfalls.
reddit this! Digg Me!
[/tech/programming/idempotence] permanent link
Tue, 22 Feb 2005
There's a good
definition of the word idempotent over on Dictinoary.com. In a nutshell, the
word is used to describe mathematical functions that satisfy the relationship
f(x)=f(f(x)): functions for which repeated applications produce the same result
as the first. For functions that satisfy this condition, you can rest assured
that you can apply the function as many times as you like, get the expected
result, and not screw anything up if you apply it more times than you
absolutely need. This turns out to be a useful concept for people developing
software systems.
One of the most common examples of this is in C-style include files. It's common practice to write code like this, to guard against multiple inclusions:
This idiomatic C code protects the include file against multiple inclusions. Include files with this style of guard can be included as many times as you like with no ill effect.
The benefit to this is that it basically changes the meaning of the code #include <foo.h> from "Include these declarations" to "Ensure that these declarations have been made". That's a much safer kind of statement to make since it delgates the whole issue of multiple inclusions to a simple piece of automated logic.
Of course, this is pretty commonplace. More is to come...
reddit this! Digg Me!
One of the most common examples of this is in C-style include files. It's common practice to write code like this, to guard against multiple inclusions:
#ifndef __HEADER_FILE_GUARD
#define __HEADER_FILE_GUARD
// ... declarations go here...
#endif __HEADER_FILE_GUARD
This idiomatic C code protects the include file against multiple inclusions. Include files with this style of guard can be included as many times as you like with no ill effect.
The benefit to this is that it basically changes the meaning of the code #include <foo.h> from "Include these declarations" to "Ensure that these declarations have been made". That's a much safer kind of statement to make since it delgates the whole issue of multiple inclusions to a simple piece of automated logic.
Of course, this is pretty commonplace. More is to come...
reddit this! Digg Me!
[/tech/programming/idempotence] permanent link
Thu, 12 Aug 2004
I'm hardly an authority on programming-related issues, but over the
years I've developed a small bag of tricks I use to make it easier
to quickly develop reliable (and debuggable/testable) software quickly.
This post is the beginning of a topic I'm going to use to describe some
of what works for me.
Most of what I'll be talking about is pretty obvious, I hope. I've just seen enough people not do this stuff, that it's worth getting it down in writing.
reddit this! Digg Me!
Most of what I'll be talking about is pretty obvious, I hope. I've just seen enough people not do this stuff, that it's worth getting it down in writing.
reddit this! Digg Me!