17 October 2016

The best DistinctCount practice in SSAS

The DistinctCount aggregation is a semi-addictive aggregation. What does it mean?

First of all, you cannot simply replace with a new value just by knowing the previous one. For example, your aggregation is SUM, your current value is 11 and you want to add a new row, containing value 2, what means that the re-aggregated value will be 13. You do not need to know the full set of values from the already processed rows. The same way SSAS handles Count, MAX, MIN; meanwhile the DistinctCount is different.

The DistinctCount usually is used for the non-numeric strings, like user, client, day or whatever you want to count. It means that in order to add a new value you have to check whether the set is not containing the value already. But, it is not as easy as it seems at the first glance. Sometimes we can avoid using the DistinctCount aggregation: simply replace it with Count aggregation. Let’s say we want to count the number of invoices. So, we just create a fact table with the headers of invoices. But what about filtering by products if there is no product information in the header table?

In my opinion, there are two best practices for the DistinctCount measures.

DistinctCount with MDX

Preparation. You have to have an attribution you‘ll go through the count and any addictive measure with the required relationships. Tip: if you already have an attribution, add an extra one with the same key, filed and set AttributeHierarchyVisible to false, in order to use it for the MDX calculations only (I’ll explain the reasoning behind it in another blog post).

Usage. The first idea is that usually you come up with something like the following code:

        [Measures].[Invoice Count]

That works and that’s fine. However, I recommend using SUM+IIF combination for an improved performance. Furthermore, I have already mentioned that in my previous post.

        [Measures].[Invoice Count] > 0,

As you see, you may dynamically add any conditions you want without any changes in the cube.
Disadvantages. The only disadvantage, I see, is performance. It becomes slow in some rough cases due to using the Set within Sum function which you have to avoid for smooth performance.

DistinctCount as the physical measure

If performance matters to you, then you have to create the measure physically, for sure.
Beware: if you are using a fact table containing the NULL values on the DistinctCount column, you have to filter them out. DistinctCount is forced to be created in the separate measure group anyway. I don’t know the reasoning, but DistinctCount doesn’t support the Preserve option in the NullProcessing property. There are two ways: create a new table/view without the NULL values or set it in the partition section.



Personally, I prefer the MDX way, it’s more flexible and easier to use. When costumers complain about performance, I’m adding a new measure group, but it slows your processing time down. However, there is some information that SSD usage can speed it up on the hardware level — I haven’t tested it yet and I don’t usually use the non-addictive aggregations though.


© Danylo Korostil