Files vs. Managed Tables - how data is stored physically?

Category: azure data lake

Question

JakubKrupa on Thu, 03 Jan 2019 12:53:10


Hello,

I am quite new to ADL and USQL. I went through quite a lot of documentation and presentations but I am afraid that I am still lacking many answers. 

Simplifying a bit, I have a large set of data (with daily increments) but it contains information about many different clients in one file. In most cases the data will be analysed for one client (one report = one client), but I would like to keep the possibility to do a cross-client analysis (much less common scenario). I am aware of the importance of correctly partitioning this data (probably keeping one client data together makes most sense). I was looking into two scenarios:

  1. I will partition the data myself by splitting the files into folder-file structure where I can have full control over how it is done, how big the files are etc.
  2. I will use managed tables and set up table partitioning

I am now looking into pros and cons of both scenarios. Some things that come to my mind are:

  1. The ability to compress data in scenario #1 (at the cost of performance of course)
  2. The ability to build a much more granular security model by using files and ADL security (e.g. give access only to one client's data)
  3. On the other hand, using the tables is much more comfortable as I will be dealing with just one data source and will not have to worry about extracting correct files, only about the correct filters in a query - in theory USQL should do the rest
  4. I would expect that the tables will offer better performance

One very important factor I wanted to investigate, before making my decision, is how the data is stored physically when using tables and partitions. I have read the documentation and I found a statement that confused me (https://docs.microsoft.com/en-us/u-sql/ddl/tables):

First we can read that: 

"U-SQL tables are backed by files. Each table partition is mapped to its own file" - this seems to make perfect sense. I would assume that if I set up partitioning by client I would end up with the same scenario as doing the partitioning myself. Fantastic! U-SQL will do all the work for me! Or.. will it not?

Later we can read that:

"..., and each INSERT statement adds an additional file (unless a table is rebuilt with ALTER TABLE REBUILD)."

Now this makes things more complicated. If I read it correctly, this means that if I will never rebuild a table I will have my data stored physically in exactly the same way as the original raw files and thus experience bad performance. I did some experiments and it seemed to work this way. Unfortunately, I was not able to match the files with partitions as the guids were different (the .ss files in the store had different guids than partitions in usql views) so this is just my guess.

Therefore I have several questions:

  1. Is there some documentation explaining in more detail how TABLE REBUILD works?
  2. What is the performance of TABLE REBUILD? Will it work better than my idea of appending (extract -> union all -> output) just the files that need to be appended?
  3. How can I monitor the size of my partitions? In my case (running local, have not checked it online yet) the guids of files and partitions in the store do not match even after REBUILD (they do for the DB, schema and table)
  4. Is there any documentation explaining in more details how .ss files are created?
  5. Which of the scenarios would you take and why?

Many thanks for your help,

Jakub

Replies

Bill Blakey FA on Fri, 04 Jan 2019 17:21:48


Jakub, I have many of the same questions.  The "Tables" in Data Lake are a bit murky, similar to SQL Server tables but then not really. I like that you can index the tables which should increase performance, but it an ADLA Table is comprised of multiple files...I'm not sure how that works, either.

I also have the scenario where I need to load monthly data into a repository, able to insert/append each month.  I also need to potentially "re-load" a month, however ADLA Tables rows cannot be deleted (unless this has recently changed).

Lastly, data sitting in ADLA Tables are not really exposed to non-ADLS services, etc.  I know, for example, that I can use PowerBI to query FILES in ADL Storage, but not TABLES.

I, too, would like answers to some of your questions!

Thanks for posting this.