Mike Schaeffer's Weblog
Fri, 01 Feb 2008
Excel, CR/LF, and CSV
I've spent a fair amount of time lately working with code that generates Comma Seperated Value files for loading into Excel. You'd think the format would be trivial, but not quite. One additional subtlety, one not covered in that 'specification', is Excel's inconsistent handling of end of line markers. As it turns out, if Excel loads a CSV file that contains a quoted, multi-line value, it expects a different line feed convention within the quoted value than the usual CR/LF. A CR embedded in a quoted field renders as a box, rather than as part of a newline. To suppress the box, CSV files for Excel need to be written with a LF-only convention within quoted values. Even then, Excel will not automatically expand rows containing a multi-line value. That has to be done manually.

Internally, Excel seems to follow the same LF-only convention that this issue with CSV files seems to imply. Taking the CODE(...) of each character in a manually entered multi-line cell value, shows only one charater, a LF, at each line break. My guess is that the quotes in a CSV file just act as a signal to turn off all special character handling, not just handling that signals new rows and cells. Either way, it's more than a little irritating that Excel compatible CSV files with multi-line values have to have two seperate end of line conventions.

reddit this! Digg Me!

[/tech/excel] permanent link

Fri, 23 Jun 2006
What's in *your* clipboard...
The clipboard is one of the oldest and most often used data exchange mechanisms in Microsoft Windows. It's been around since Windows 1.0 and basically all applications use it to support operations like copy, cut, and paste. One of the more interesting aspects of the way the clipboard works is that it allows applications to copy data to the clipboard in multiple formats. For example, if you copy text in Microsoft Word to the clipboard, it's not copied in one format, it's copied in seventeen. One way to see this is to open the Clipbook viewer application (start>>run, clipbrd.exe), open the Clipboard window, and look at the list of data formats in the view sub menu. This is how different applications negotiate data formats when copying and pasting data; this is why you can copy a spreadsheet from super-smart Excel, paste it into super-dumb Notepad, and still get reasonable results.

In the process of adding better clipboard support to vCalc, I wrote a small tool for dumping clipboard contents to a console window. This tool, cbdump, runs from the command line and shows a list of all the data formats currently on the clipboard. It can also dump out the data in hexadecimal format, so you can see the actual data, unaltered by applications. To show what it looks like, here's sample output after copying a small Excel table to the clipboard:
c009: "DataObject"  (4 bytes)
000e: CF_ENHMETAFILE  (0 bytes)
0003: CF_METAFILEPICT  (16 bytes)
0002: CF_BITMAP  (0 bytes)
c2e2: "Biff8"  (5120 bytes)
c2e6: "Biff5"  (4608 bytes)
c1f3: "BIFF4"  (1904 bytes)
c2e3: "Biff3"  (1773 bytes)
c2f7: "Biff"  (913 bytes)
0004: CF_SYLK  (1131 bytes)
c2ec: "Wk1"  (197 bytes)
0005: CF_DIF  (137 bytes)
c298: "XML Spreadsheet"  (943 bytes)
c0f2: "HTML Format"  (2344 bytes)
000d: CF_UNICODETEXT  (22 bytes)
0001: CF_TEXT  (11 bytes)
c295: "Csv"  (11 bytes)
c0a5: "Rich Text Format"  (3071 bytes)
c00b: "Embed Source"  (6144 bytes)
c004: "Native"  (6144 bytes)
c003: "OwnerLink" ERROR in GetClipboardData
c00e: "Object Descriptor"  (152 bytes)
c00d: "Link Source"  (135 bytes)
c00f: "Link Source Descriptor"  (152 bytes)
c1f2: "Link"  (31 bytes)
0081: CF_DSPTEXT  (13 bytes)
c002: "ObjectLink"  (39 bytes)
c013: "Ole Private Data"  (792 bytes)
0010: CF_LOCALE  (4 bytes)
0007: CF_OEMTEXT  (11 bytes)
0008: CF_DIB  (57368 bytes)
0011: CF_DIBV5  (57452 bytes)
150642 bytes
The leftmost column is the integer ID of the clipboard format. This is the ID used by Windows to identify the format used by a particular chunk of data. Following the ID is the clipboard format's name, of which there are two kinds. By default, Windows knows about a few predefined types of clipboard data: these are identified by constants in the header file and are things like CF_TEXT (text data) and CF_DIB (a device indepentant bitmap). However, to handle the case where one of the Windows default formats will not work, Windows also allows applicatons to register additional formats and give them useful names. In the list above, that includes formats like "XML Spreadsheet", "Csv", and "Biff8". This allows applications like Excel to communicate rich, specialized data to applications that support it (Excel itself being a good example).

A couple paragraphs ago, I used the word 'negotiate' when referring to the way that two applications use the clipboard to exchange data. As currently presented, that negotiation looks a lot like the source application making everything available and the sink application cherry picking the one format it wants. This is not a very balanced or efficient negotiation. However, the clipboard is actually more intelligent than that. To see what I mean, go to Excel, copy a huge range of data to the clipboard and run cbdump. The output list of formats will basically be the same as before, the difference is in the way the list appears. You'll notice that the list scrolls past unevenly, jerking along with some formats taking longer to list than others.

The reason for the uneven scrolling is that Windows does not force an application to always provide data in every clipboard format it supports. When Excel copies data to the clipboard, what it's really doing is telling Windows that it has the ability to provide data in a format, it is not necessarily rendering the data in every format at the time of the copy. Then, when a sink application requests data from the clipboard, Windows can see if it has an actual copy of the requested format. If not, it then requests that Excel render the clipboard contents in the requested format and passes that result to the requesting application. The way cbdump works, to show the size of each data format on the clipboard, it requests a copy of each available data format. This forces Excel to render every supported format, some of which take longer than others. If you happened to have a Windows message viewer looking at the message stream to the Excel window, you'd see a series of window messages requesting rendered copies of unrendered formats.

reddit this! Digg Me!

[/tech/excel] permanent link

Tue, 13 Jun 2006
Excel links...
Recently, I was asked for advice on how to learn Excel macro programming. I'm not the expert, but I read sites and books written by folks who are. Here are some useful links.

Websites: Books:
The last book is particularly interesting: it focuses on the C/C++ API to Microsoft Excel. Since Excel 5, when Microsoft introduced VBA and the Excel COM object model, this programming mechanism seems to have fallen out of vogue, but it is still being maintained, and represents a way to write really fast and secure Excel add-ins.

reddit this! Digg Me!

[/tech/excel] permanent link

Wed, 09 Nov 2005
Thirty days hath September...

"Thirty days hath September,
All the rest I can't remember.
The calendar hangs on the wall;
Why bother me with this at all?"

http://leapyearday.com/30Days.htm



Here's an Excel one liner that computes the number of days in a particular month. Cell A2 contains the year of the month you're looking for, Cell B2 contains the months' ordinal (1=January, 2=February, etc.):
=DAY(DATE(A2,B2+1,1)-1)
This is mainly useful to illustrate what can be done with Excel's internal representation of dates. Dates and times in Windows versions of Excel are normally stored as the number of days from January 1st, 1900. You can see this by entering a date in a cell, and then reformatting the cell to display as a number rather than a date. For example, this reveals April 1st, 2004 to be represented internally as the number 38078. This is because there are 38,078 days between January 1st, 1900 and April 1st, 2004.

The formula above relies on this in its computation of the number of days in a month. The sub-expression DATE(A2,B2+1,1) computes the date number for the first day of the month immediately following the month we're interested in. We then subtract one from that number, which gives us the date number for the last day of the month that we are interested in. The call to DAY then returns the number of the day within the month, which happens to be the number of days in the month.

reddit this! Digg Me!

[/tech/excel] permanent link

Fri, 04 Nov 2005
"Apropos of..." Better Autofilter Results.
At my job, we use Excel extensively to keep track of software testing progress. One typical use is to maintain a list of features to be tested, along with their current pass/fail statuses and an attempt at a rough subdivision into functional areas. Excel's AutoFilter then makes it easy to ask questions like "show me all failed tests relating to function block scheduling."

This works really well as long as "function block scheduling" is one of the categories into which you've subdivided your features list. If it's not, you have to get a little creative to filter your list. One approach to this problem I've found useful is filtering based on columns populated with a formula similar to this:
=IF(ISERROR(SEARCH($K$5,K6)),"No","Yes")
If column K contains feature descriptions, this formula returns "Yes" is the description matches the search string in K5 and "No", otherwise. Filtering based on this formula makes it possible to display every list item whose description matches a word. If there is more than one column to search, you can use string concatenation to aggregate the columns together:
=IF(ISERROR(SEARCH($K$5,K6&L6&M6)),"No","Yes")
So, why the name apropos? Follow this link.

reddit this! Digg Me!

[/tech/excel] permanent link

Tue, 11 Oct 2005
Excel 12's Conditional Formatting Rules
David Gainer has Summarized a a number of new conditional formatting rules in Excel 12, over on the Excel 12 blog. These rules were designed to "make a greater number of scenarios possible without needing to write formulas." In other words, all these scenarios have simple solutions directly visible in the Excel 12 UI.

Well, if you can't wait for Excel 12, Excel is pretty darned powerful as it is, and as Mr. Gainer states: most of these scenarios have formula-based approaches that work right now. Here are some of the approaches for current versions of Excel:
  • With data bars, color scales, or icons based on the numeric value in the cell, percentages, percentiles, or a formula. See the posts on data bars, color scales, and icon sets for more information on each of these. - This approach to 'databars' generalizes to formula-based scaling, although it's not as pretty, not a color scale, and not an icon set.
  • Containing, not containing, beginning with, or ending with specific text. For example, highlighting parts containing certain characters in a parts catalog. - Use a formula: a lot of these conditions can be tested using FIND: =FIND(string, A1)=1, checks for parts that begin with string, for example.
  • Containing dates that match dynamic conditions like yesterday, today, tomorrow, in the last 7 days, last week, this week, next week, last month, this month, next month. For example, highlight all items dated yesterday. The great part about these conditions is that Excel handles calculating the date based on the system clock, so the user doesn.t need to worry about updating the condition. - Use a formula: the system date is available via NOW(), and Excel offers plenty of date arithmetic functions to check for specific conditions.
  • That are blank or that are non-blank. - Use a formula: =ISBLANK(A1) or =NOT(ISBLANK(A1))
  • That have errors or that do not have errors. - Use a formula: =ISERROR(A1) or =NOT(ISERROR(A1))
  • That are in the top n of a selected range (where n is whatever number you want) OR that are in the top n percent of a selected range (again, where n is adjustable). For example, highlighting the top 10 investment returns in a table of 1,000 investments. - Use a formula: =RANK(A1, range)>n.
  • Cells that have the bottom n values OR cells that are the bottom n percent of a selected range. - Use a formula: =RANK(A1, range)<ROWS(range)-n.
  • Cells that are above average, below average, equal to or above average, equal to or below average, 1 standard deviation above, 1 standard deviation below, 2 standard deviations above, 2 standard deviations below, 3 standard deviations above, 3 standard deviations below a selected range. - This type of thing can be solved using a particular form of formula: =A1<(AVERAGE(range)-n*STDEV(range)) or =A1>(AVERAGE(range)+n*STDEV(range)). For large ranges, it probably makes sense to move the computation of AVERAGE and STDEV into a cell, and have the conditional format reference (with an absolute reference) that cell.
  • Cells that are duplicate values or, conversely, cells that are unique values. - Use a formula: =COUNTIF(range, A1)=1 or =COUNTIF(range, A1)>1. Ensure that the range you use in the formula has an absolute address. If your range is sorted on the 'key' field, you can use this style of formula: =A1<>A2. This can be much, much faster, particularly for large tables. (For the Comp. Sci. types it's O(N), rather than O(N^2), once you have sorted data.)
  • Based on comparisons between two columns in tables. For example, highlight values where values in the .Actual Sales. column are less than in the .Sales Target. column. - Use a conditional format formula: =A1<B1. Apply it to the entire column you want shaded, and Excel will evaluate the seperately for each cell. The cell references in the format formula are relative to the current cell in the selected range. The current cell is the cell in the range that is not highlighted (but is surrounded by a selection border), and can be moved around the four corners of the range with Control+. (period).
  • When working with tables, we have also made it easy to format the entire row based on the results of a condition. - Relative formulas can be made to do this: select an entire range, and define a conditional formula using absolute column addresses (ie: =$a1). Excel evaluates the format formula for each cell in the range, and since the column addresses are absolute, each cell in a row will pull from the came columns. Therefore, each cell in a row will share the same conditional format, which is what we want.
Based on this, you don't have to wait for Excel 12 to get a lot of these features, you just have to wait for Excel 12 if you want Excel to do it for you automatically. My suggestion would be to learn how to use conditional formatting formulas, but I tend to be "here's how to fish" kind of guy more than a "here's a fish" kind of guy.

reddit this! Digg Me!

[/tech/excel] permanent link

Fri, 07 Oct 2005
Excel 12 Databars, Without VBA.
I suspected as much, but Excel has a way to duplicate my UDF using Excel formulas.

=REPT("█",A1)&REPT("▌",ROUND(FLOOR(A1,1),0))

That formula evaluates to a bar of length A1 units, rounded to the nearest 0.5. Rescaling can be done in another cell. If you're interested in a bar that can be right-justified, you can use this:

=REPT("▐",ROUND(A1-FLOOR(A1,1),0))&REPT("█",A1)

The trickiest part about this is getting the block characters into the formula. For that, I reccomend using the Windows Character Map.

Qualitatively compared to VBA, this method requires more logic to be represented in the spreadsheet: that adds compelxity for readers and makes it tricker to set up than the VBA. On the other hand, it avoids the performance hit of calling UDF and the requirement that the spreadsheet contain a macro. I honestly don't know which is better style, but can say that this would be a perfect time to use a paramaterized range name (if Excel had such a thing).

reddit this! Digg Me!

[/tech/excel] permanent link

Excel 12 Databars, Now. (Sort of)
Microsoft has just announced a cool new feature on the Excel 12 blog: the databar. I think a picture (linked from Microsoft's Excel 12 Blog) can explain it better than I can:



This will be a nice way to look for trends/outliers, but I can also see it being useful for tracking parallel completion percentages in status reports, etc. Of the Excel 12 features announced so far, this is the one that I'm the most excited about. Of course, it's also the one that's easiest to approximate in Excel <12. Andrew has an approach using Autoshapes on his blog, and I'm going to present a slightly different approach.

IMO, his approach looks a lot better, this approach has the benefit of updating automatically. Pick your poison.

It all centers around this little UDF:
Option Explicit

Function GraphBar(x As Double, _
                  Low As Double, _
                  High As Double, _
                  ScaleTo As Double) As String

    x = ((x - Low) / (High - Low)) * ScaleTo
    
    Dim i As Integer
    
    Dim blockFull As String
    Dim blockHalf As String
    
    blockFull = ChrW(9608)
    blockHalf = ChrW(9612)
    
    GraphBar = ""
    
    For i = 1 To Fix(x)
        GraphBar = GraphBar + blockFull
    Next
    
    If x - Fix(x) > 0.5 Then
        GraphBar = GraphBar + blockHalf
    End If
End Function
This isn't rocket science: all it does is rescale x from the range [Low, High] to the range [0.0, ScaleTo]. Then, it strings together that many Chrw(9608)'s, followed by a Chrw(9612), if the scaled value's fractional part is >0.5. The trick in this is that Chrw(9608) and Chrw(9612) are VBA expressions that produce the the Unicode equivalent of the old line drawing characters IBM put in the original PC [1]. 9608 is a full box ("█"), 9612 is a half box on the left ("▌"). The result of this function ends up being a string that (when displayed as Arial) looks like a horizontal bar. ("████▌"). Put a few of those in adjacent cells, and you get this:



The formulat in C2 (and filled down) is =GraphBar(B2,MIN(B$2:B$8),MAX(B$2:B$8),5). The MIN and MAX set the scale, the 5 sets the maximum length of a bar. The maximum length, font size, column width can be tweaked to produce a reasonably attractive result, although I do reccomend using vertical centering.

If you want to get a little fancier, conditional formatting works on plot cells...



...whitespace can possibly improve the appearance...



...and this technique can scale.



1] (The original PC didn't have stanard graphics, it was an option. If you bought the monochrome, non-graphics, video board, characters like this were as close as you could get to a bar chart.)

reddit this! Digg Me!

[/tech/excel] permanent link

Thu, 06 Oct 2005
Excel: 'Repeat', the Simplest Macro
This is a simple little two-bit Excel trick that I find myself using all the time, particularly when formatting worksheets.

In Excel, Control+Y is the 'other half' of the Undo/Redo pair. If you undo an action and want to redo what you just undid, Control+Y undoes the undo, so to speak. However, if you haven't undone anything, and there's nothing on the redo queue, Control+Y repeats the last single action you took.

Repeatable actions can actually be quite complex. For example, opening the Format Cell dialog box and applying a format counts as one repeatable action, regardless of how many format attributes you change. Once you make that format change to one cell and before you do anything else Control+Y has become a key that applies that specific format change to as many other cells as you like.

In a sense, Control+Y is a command that's eternally bound to a simple macro that Excel keeps updating with your last action. If you plan your work to group actions together, this 'automatic' macro can save a lots of time.

reddit this! Digg Me!

[/tech/excel] permanent link

Mon, 03 Oct 2005
Formulas Driven from AutoFilters
I had this written out and then discovered a better way. SUBTOTAL is "sensitive to AutoFilter settings", right? Assuming A1 isn't empty, this formula =subtotal(a1, 2)=1 returns TRUE if row 1 is visible and FALSE otherwise. No VBA necessary.

Not too long ago, I made a post that describes how to replicate some of the behavior of Excel Autofilters using a purely formula based approach. One of the arguments I put forward in support of that technique is that it makes it possible to use filtered result sets to drive other calculations. However, the approach also has two disadvantages: it's slow to compute and can be a little tricky to setup and understand. As a sort of intermediate ground between using the AutoFilter and re-implementing it, this post describes how an Excel formula can determine if a row is a member of an AutoFilter result set. The magic bit is this little user defined function:
Function IsVisible(rng As Range) As Boolean
    
    IsVisible = True
    
    Dim row As Range
    Dim col As Range
           
    For Each row In rng.Rows
        If row.RowHeight = 0 Then
            IsVisible = False
            Exit Function
        End If
    Next
        
    For Each col In rng.Columns
        If col.ColumnWidth = 0 Then
            IsVisible = False
            Exit Function
        End If
    Next
End Function
Given a range, this function returns true if every cell in the range is visible (non-zero row height and column width). The way Excel works, the Row Height of a row hidden by the Autofilter is reported as zero. Therefore, IsVisible returns false when given a reference to a cell in a hidden AutoFiltered row. Of course, it also returns False for cells in manually hidden rows and columns, but if you're careful, you can avoid that.

For a simple use case, this function can be used to generate alternating color bars that always alternate regardless of the AutoFilter settings. To set it up, Put TRUE in the topmost cell of a free column next to the AutoFilter to be colored. Below the TRUE, fill down with a formula like this: =IF(isvisible(D2),NOT(D1),D1). This formula inverts the value in the column, but only for cells that are visible. This guarantees that regardless of the AutoFilter settings, this column will always alternate TRUE/FALSE in the set of visible rows. This column can then be used to drive a conditional format that highlights alternating visible rows.

A couple sidenotes:
  • This function works because adjusting an AutoFilter triggers recalculation, and Excel notices that this function depends on row heights. For hiding columns, it's a lot less reliable. All the calls to IsVisible have to be forced to recompute after the column is hidden or displayed. To do this, IsVisible can be marked as volatile and recalculation forced by pressing F9. This is a lousy solution.
  • To optimize performance, the function short-circuits its search. The Exit Function's bail out of the calculation as soon as the first hidden row or column is discovered.
  • Excel's SUBTOTAL intrinsic function is also sensitive to AutoFilter settings.


reddit this! Digg Me!

[/tech/excel] permanent link

Tue, 20 Sep 2005
The world's most popular functional programming language...
This may be something of a suprise, but Excel has even gotten the attention of Microsoft Research. Simon Peyton Jones, Margaret Burnett, and Alan Blackwell have written a paper that describes "extensions to the Excel spreadsheet that integrate user-defined functions into the spreadsheet grid". Of course, Excel doesn't do this... but, I wonder if that should be "Excel doesn't do this yet".

As a sidenote, this reminds me a little of how LabView handled subfunction definitions: subfunctions are defined using the same visual tools as top-level functions. It worked, but 'felt' a little heavy weight in actual use.

reddit this! Digg Me!

[/tech/excel] permanent link

MoveAfterReturn, OnTime, and the Excel Status Bar
I really liked This post by Dick Kusleika, over on Daily Dose of Excel. I'm a big fan of controlling frequently used options with keyboard shortcuts. To riff on Mr. Kusleika's post a little, here's a refinement I've found useful in the past for macros like these. Basically this allows the same macro to toggle a state as well as non-destructively display the current state.

The first time the macro MaybeToggleMAR is invoked, it displays the current state in the status bar, and sets a timer to expire in 3 seconds. If the macro is invoked a second time before the timer expires (easy to do if it's bound to a keystroke) the state is toggled. Technically speaking, the trickiest bit is that the function that sets the 3 second timer also has to handle cancelling any previous instance of the same timer. It works without the timer cancellation, but without it, the UI behaves oddly after multiple keypresses in rapid succession.

Chip Pearson's website has useful content discussing the Excel API's for Timers and the Status Bar.

Here's the code: to use it, stick it in a module and bind MaybeToggleMAR to the keyboard shortcut of your choice.
Option Explicit
Private MARChangesEnabled As Boolean

Public NextDisableTime As Double

Sub DisableMARChanges()
    Application.StatusBar = False
    MARChangesEnabled = False
End Sub

Sub DisableMARChangesSoon()
    On Error Resume Next
    Application.OnTime NextDisableTime, "DisableMARChanges", , False
    
    NextDisableTime = Now + TimeSerial(0, 0, 3)
    Application.OnTime NextDisableTime, "DisableMARChanges", , True
End Sub


Sub MaybeToggleMAR()
    Dim NewStatusText As String
    
    NewStatusText = ""
    
    If MARChangesEnabled Then
        Application.MoveAfterReturn = Not Application.MoveAfterReturn
        NewStatusText = "Status changed: "
    Else
        MARChangesEnabled = True
        NewStatusText = "Second press will change status: "
    End If
            
    If Application.MoveAfterReturn Then
        NewStatusText = NewStatusText & "MoveAfterReturn Enabled"
    Else
        NewStatusText = NewStatusText & "MoveAfterReturn Disabled"
    End If
                
    Application.StatusBar = NewStatusText
    
    DisableMARChangesSoon
End Sub


reddit this! Digg Me!

[/tech/excel] permanent link

Wed, 24 Aug 2005
I had a dream...
I literally dreamed about this last night. It would be wonderful if Excel supported formulas like this:
=LET(value=MATCH(item,range,0), IF(ISERROR(value), 0, value))
If you're into Lisp-y languages, it'd look like this:
(let ((value (match item range 0)))
  (if (is-error? value) 0 value))
The function call =LET(name=binding, expression) would create a local range name named name, bound (equal) to the value returned by binding, to be used during the evaluation of expression. In the example above, during the evaluation of IF(ISERROR(value), 0, value)), value would be bound to the value returned by MATCH(item, range, 0).

It's worth pointing out that this is slightly different from how normal Excel range names work. Range names in Excel work through textual substitution. With textual substitution, the initial expression would be logically equivalent to this:

=IF(ISERROR(MATCH(item, range, 0)), 0, MATCH(item, range, 0)))
In other words, Excel would treat every instance of value as if MATCH(item, range, 0) was explictly spelled out. This means there are two calls to MATCH and two potential searches through the range. While it's possible that Excel optimizes the second search away, I'm not sure that anybody outside of Microsoft can know for sure how this is handled.

Microsoft's current reccomendation for handling the specific ISERROR scenario in the first expression is this VBA function:

Function IfError(formula As Variant, show As String)

    On Error GoTo ErrorHandler

    If IsError(formula) Then
        IfError = show
    Else
        IfError = formula
    End If

    Exit Function

ErrorHandler:
    Resume Next

End Function
This isn't bad, but it requires that spreadsheet authors and readers understand VBA. It also imposes significant performance costs: calling into VBA from a worksheet takes time.

reddit this! Digg Me!

[/tech/excel] permanent link

Thu, 28 Jul 2005
List filtering with Formulas in Microsoft Excel: How it's done
Now that I've written a little about why you might want to replace Excel AutoFilter, here's how to actually do it. To frame the discussion, there are two problems to solve:
  • Deciding which rows of the input set are part of the result set
  • Displaying the result set in a contiguous sequence of spreadsheet rows.
The first problem is easy: add another column alongside the input set with a formula that evaluates to TRUE if the row belongs in the result. This can be any valid Excel formula: it can include complex logic, it can depend on other cells containing control parameters. In my example spreadsheet, this formula is in column H, labeled In Query?:



The tricky bit of the formula-based filter is the second problem: displaying the result set in a contiguous range of rows with no gaps. Each cell that might display part of the result set has to figure out itself what part of the result set to display, if any, and pull the data from the input set. A simple MATCH or LOOKUP can't handle this, since MATCH or LOOKUP can't be told to return the second, third, or nth match. They return the first match, which isn't quite enough for what we're trying to do.

As it turns out, even though having the result set compute a mapping from the input set is quite hard, solving the reverse problem isn't too bad. Having the input set compute the mapping to the result set is easy. Here's how it works, by column:
  • Ord. - The row ordinal number of the row in the input set, starting with 1.
  • Result Ord. - This column starts at zero, in the row preceeding the first row of the result set, and increments by 1 for each row where In Query? is TRUE. For each row with In Query? of TRUE, this column is the row ordinal number of this row in the result set.... We are almost there.
  • Result Rows. - The input row ordinal of each row in the output set. This is done by using MATCH to find the first row for each number in the Result Ord..
Once the Result Rows. column has been calculated, populating the actual result set is just a matter of using INDEX. ISERROR can be called on cells in Result Rows. to identify rows that don't contain values. After all this is said and done, we have a spreadsheet range that contains only a result set, updates like every other range in Excel, and can be used in formulas like every other range. I have a sample spreadsheet that implements a lot of this here.

reddit this! Digg Me!

[/tech/excel] permanent link

Mon, 18 Jul 2005
PS: I think that AutoFilter is typical of Excel...
I think that the weaknesses of the Excel AutoFilter turn out to be pretty typical of Excel in general.

To me, the brilliance of the spreadsheet was that it took a data model that business people were familar with, the accountant's paper spreadsheet, and layered on automatic computation and reporting facilities in a natural way. There's something very intuitive about going to cell c1, entering =A1+B1, and then having C1 contain the sum of the other two cells, automatically updated as the source cells change. It just makes sense, and is at the very core of every software spreadsheet dating back to the first, VisiCalc.

For years, spreadsheets worked at making this model work better. Lotus 1-2-3 introduced something called natural recalculation order that made it easier to follow the logic of spreadsheet calculation. Somewhere along the way, spreadsheets started doing limited recalculation, where formulas that didn't change weren't recalculated (thus saving time). New intrinsic functions were added, and Excel made a huge stride when it added array formulas: individual formulas that can produce more than one result. The gateway to user defined functions written in VisualBASIC was another huge win.

The core strength of all of these ideas is that they rely on and extend the core concept of the software spreadsheet: the software tracks dependancies between cells and automatically recalculates the appropriate results as necessary. As powerful as that concept is, Microsoft lost the plot somewhere around Excel 4 or 5 and keeps sinking money and effort into features that don't fully participate:
  • Excel has two data filter features: neither one can automatically update a table as a part of recalculation.
  • PivotTables don't update when their source data updates either. (For SQL data sources, this is understandable, but not so much when the source data comes from Excel itself).
  • PivotTables produce tables with missing values (to improve the formatting), which makes them very difficult to query with spreadsheet lookup functions.
  • The historgram function (among others) of the Analysis ToolPak is a one-time thing: you use it, it generates a histogram, and that's it. It's not possible to incorporate histogram generation into the dataflow driven recalculation of a spreadsheet.
  • There's no way to use an Excel formula to determine if a row is excluded or included in an AutoFilter query. Actually, there's no way to have the result set of an AutoFilter query drive spreadsheet recalculation at all.
Maybe this is being picky, but spreadsheets have a real strength in that they made it a lot easier for non-techies to specify how a computer can automatically solve certain types of problems. It's just a shame that so many of Excel's features are excluded from the natural way Excel is programmed.

reddit this! Digg Me!

[/tech/excel] permanent link

List filtering with Formulas in Microsoft Excel: Motivation
One of Excel's more interesting features for querying data sets is the AutoFilter. Applied to a table of data in a spreadsheet, The AutoFilter allows the table to be queried for subsets of data based on combo boxes in the table's header row. It's a simple way to filter out extraneous data and it can support quite elaborate query semantics (since it can filter based on values in computed cells).

However, AutoFilter is not without its problems:
  • AutoFilter imposes its own user interface: if you want a look-and-feel other than stock, you're out of luck.
  • For wide data tables with lots of columns, it can be hard to see the current AutoFilter query. To see the entire query requires horizontal scrolling down the header row.
  • Cell formatting and AutoFilter are independant of each other. If you want position dependant formatting (alternate row formatting, for example), it has to be recreated after each AutoFilter adjustment.
  • An AutoFilter works by selectively 'hiding' rows in the worksheet it's a part of. This means that an AutoFiltered list can't share rows with anything else that you don't also want selectively 'filtered' from view.
  • You can't have more than one AutoFilter on a worksheet tab.
  • AutoFilter isn't part of the natural 'ebb and flow' of the life of a spreadsheet: it doesn't participate in the dependancy driven formula solver that drives Excel's computational capability. This has some profound (bad) implications:
    • As data rows are added and removed from the list being AutoFiltered, the AutoFilter has to be removed and reapplied to the new data list to reflect changes to its source.
    • You can't use AutoFilter to filter a list and then search that list with =LOOKUP() or =MATCH(): the lookup operation will search the entire list, not the filtered list.
    • If you AutoFilter a list that contains calculated cells, and those cells change value, the set of filtered rows is not updated.
Anyway, I could go on, but I hope it's pretty clear by now that there are sometimes good reasons to look for other list filter mechanisms than AutoFilter. (FYI: 'Advanced Filter' has its own limitations, some of which are very similar to AutoFilter's.) I'll post a way to get AutoFilter-like behavior directly from Excel formulas. This technique has its own issues, but it does address lots of the issues I mentioned here.

reddit this! Digg Me!

[/tech/excel] permanent link