Conditional Aggregates

The Problem: A legacy table contains amounts and a char column indicating whether the amount is a credit or a debit. We want an aggregate (over a given range) of the amounts. We need to add the credits and subtract the debits.

The Solution(s): Two possible approaches to the problem suggest themselves. One is to sum all of the credits and all the debits and them subtract the total debit from the total credit. The second approach is to include a case statement inside the aggregate function to add or subtract the given row’s amount based on the credit/debit column.

The Code: Implement a cut down version of the table with some contrived sample data:

create table tbl_ledger
(
	id int identity(1,1),
	credit_debit char(1),
	amount money,
	[description] varchar(128) 
)

insert tbl_ledger(credit_debit,amount,[description])
values 
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment')

This table has 10 rows, 7 are $10 credits and 3 are $10 debits so we would expect the sum to return $40 ($70 credit – $30 debit).

Implement option 1, subtract the aggregates (this could be done a number of ways, including using variables to hold interim results, or common table expressions. I’ve chosen a sub query as I think it reads well. The query plans for sub queries and cte’s are similar, using variables muddies the waters a bit):

select sum(amount) - (select sum(amount) as [debit]
			from tbl_ledger 
			where credit_debit = 'D'
			)
as [credit]
from tbl_ledger 
where credit_debit = 'C'

As hoped this returns 40.

Implement option 2 using a case statement inside the sum function:

select
	sum(case
			when credit_debit = 'C' then amount
			when credit_debit = 'D' then -1 * amount
		end) as [total cash]
from tbl_ledger

This again returns a correct result.

The Analysis: To me option 2 is a more elegantly coded solution, but how does it stack up performance wise? To get some measurable results rerun the insert multiple times.

insert tbl_ledger(credit_debit,amount,[description])
values 
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment'),
	('D',10,'Refund'),
	('C',10,'Payment'),
	('C',10,'Payment')
go 100000

And run the two queries side by side with statistics on:

set statistics io on
set statistics time on

select
	sum(case
			when credit_debit = 'C' then amount
			when credit_debit = 'D' then -1 * amount
		end) as [total cash]
from tbl_ledger

select sum(amount) - (select sum(amount) as [debit]
						from tbl_ledger 
						where credit_debit = 'D'
						)
as [credit]
from tbl_ledger 
where credit_debit = 'C' 

set statistics io off
set statistics time off

I got the following:

(1 row(s) affected)
Table ‘tbl_ledger’. Scan count 1, logical reads 4167, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 281 ms, elapsed time = 295 ms.

(1 row(s) affected)
Table ‘tbl_ledger’. Scan count 10, logical reads 8334, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 483 ms, elapsed time = 130 ms.

With the execution plans:

QPlan1

The sub query outperforms the conditional aggregate primarily because the query can be parallelised. The conditional aggregate does a single table scan versus two table scans for the sub query, 4167 logical reads versus 8344 logical reads – but when I ran the tests against cold caches the read ahead mechanism pulled all data pages into cache for both queries so I didn’t notice a significant change in the comparative result.

Running the second query with a maxdop hint brings the results closer:

(1 row(s) affected)
Table ‘tbl_ledger’. Scan count 1, logical reads 4167, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 328 ms, elapsed time = 347 ms.

(1 row(s) affected)
Table ‘tbl_ledger’. Scan count 2, logical reads 8334, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1 row(s) affected)

SQL Server Execution Times:
CPU time = 249 ms, elapsed time = 296 ms.

And drops the parallel operators:

QPlan2

The Conclusion: After analysis it seems that the sub query out performs the conditional aggregate option – even if we suppress parallelism. That said these tests were on a cut down table, and a wider table with more rows might bring IO more into play – especially if indexes are added to the mix. Adding an index on the credit_debit column with amount as an included column speeds up the sub query even more – replacing the two table scans with index seeks.

About these ads
This entry was posted in SQL Server, T-SQL. Bookmark the permalink.

6 Responses to Conditional Aggregates

  1. Lotus says:

    I liked your article crisp and clear. Aesthetically in any case I prefer the 2nd query,

  2. Carl says:

    Assuming your credit_debit column is constrained to be either C or D, what if you modified the case statement slightly.

    case when credit_debit = ‘C’ then amount else -amount end

    I don’t know if it makes any difference to the compiler but visually it seems like there are less decisions. You don’t have to check for the value D (one less compare) and you don’t have to do the multiplication (-1 * amount).

    • matt.bowler says:

      Good comment. While neither of those changes affect the query plans they do speed up execution and reduce CPU time slightly.
      Coming from a mathematical background I would have expected -amount to be exactly equivalent to -1*amount – but it seems SQL Server deals with them slightly differently. CPU drops slightly.
      As for the case – this will depend on the exact situation how you code this – how confident are you in the data? Do you want to use you else clause to catch any non ‘C’ or ‘D’ values? But as you mentioned ‘assuming the data is appropriately constrained’ your suggestion drops CPU use by around 25% on my sample data.

  3. Hi Matt, great post. I’ve been playing around with this for an hour or so now. It really sparked my interest.

    I am intrigued by your comment: “option 2 is a more elegantly coded solution” and wonder what in particular draws you to option 2? I agree there is a very nice cleverness to it , it is also succinct and the function is self-evident. However it strikes me that these two examples are perfect examples of a RBAR (non-relational) approach vs. declarative, set-based SQL.

    Option 2 forces a row-by-row comparison of the credit_debit column, as opposed to Option 1 which leaves the heavy lifting to the optimisation engine. I personally like Option 1 and a sensible covering index on (credit_debit, amount) enhances performance and removes the issue of a potentially wide table.

    What are your thoughts? Perhaps you have encountered situations where Option 2 is better for reasons other than performance? Cheers.

    • matt.bowler says:

      Good comments Nick. Strictly speaking neither of the options are RBAR, although I’ll admit option 2 does look suspiciously like it :). Remember that once the optimizer has compiled it’s plan – the actual data is processed sequentially (ignoring parallel threads).

      I think I favoured option 2 because it is a more abstract framework that easily adapts to other aggregations – or even other functions. Essentially – manipulate the value of column A according to the value of column B.

      But just because I think it’s intellectually elegant doesn’t make it right. Always test.

      • This is pure gold! Thank you for your feedback, it has sparked some really great conversation between a couple of my colleagues and myself.

        I have battled with my understanding of RBAR – you’ve really challenged that and your comments above have helped clarify things. If I was to describe option 2 it would be:

        “For each element in A, yield x or y”

        Which is subtly different to:

        “For all elements in A that satisfy p, yield x.
        And for all elements in A that satisfy q, yield y”

        We have discussed this at length, and I am beginning to appreciate that any statement of the type, “for each…” is not necessarily RBAR. As you point out, in this case all elements meet the criteria and therefore must be parsed.

        So I think more than anything your post has really challenged and clarified my understanding of what the term RBAR encompasses and this has been a good thing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s