Michael Schreuder on Wed, 13 Dec 2017 10:41:14
We have a dimension table which has over 300 million rows of data. We use this table in SQL queries, but not in any SSAS models.
We would like to extend this with some summary stats and dates around the customers activity. I did originally design this as a narrow table with a column identifying what the date or stat was, but found that this table was unusable by the analysts
as it made the code very complicated to get multiple stats out, and required extensive pivoting to make it usable, which made it very slow.
So we are now changing this to a wide table which will have the same primary key as dimension with all the stats as columns.
The question here is why create a 2nd table, why not just add these as columns to the main dimension. This would reduce data duplication (some columns from the dim would be stored in both tables), or if you don't store that data, you would always have to join creating more complex query plans.
I have this urge to create it as a 2nd table, but I can't think of a good technical reason either way. Has anyone got any thoughts on this?
DIEGOCTN on Wed, 13 Dec 2017 10:50:33
Dimension table with over 300 million rows? Dimension table...? No Fact table?
Michael Schreuder on Wed, 13 Dec 2017 14:07:59
Yes, fact table is now approaching 3.5 billion rows
Thomas L. LeBlanc (MVP) on Thu, 14 Dec 2017 16:30:38
When a dimension or fact table becomes large, it is difficult to manage new columns.
Usually, the request is for a different reporting "Data Mart". New fact tables to represent the new requirements is not a bad thing.
In Dimensional Modeling, sometimes you have to leave the "normalized" structure of table behind and build based on the requirements and repeat some data.
Tom Phillips on Thu, 14 Dec 2017 17:32:38
There is no "right" answer to your question. It is really a question of preference.
It could take a long time to add new columns to a table that large. So you might want to simply create a new table with the new fields.
However, the real question is does the new table serve any purpose by itself. If the new table does not serve any purpose by itself and you always need join the original table with the new table, then I would advocate to simply add the columns.
Michael Schreuder on Mon, 18 Dec 2017 11:13:32
Thanks for the replies, that was kind of my thoughts too.
The problem here is that even if we have a separate table it will be just as much of an issue to add new columns and populate them with data.
I still just can't decide as with clustered columnstore indexes the space requirements are much lower for the columns that would be duplicated as there not many unique values (compared to num rows), but then why even have the first dimension table at all as most queries could just use this new table.