How to split the input blob storage files in ADF Process

Category: azure data factory

Question

Murali Krishna V on Thu, 12 Oct 2017 17:56:33


Hi,

My input blob storage file data size in GB's, want to split into small sizes then store into another blob storage called outputblob storage. From outputblob storage will move data on to Azure Data Warehouse.

Using HDInsight with spark/python scripting want to split.

Please let me know the code for the same

Replies

Gerhard Brueckl on Fri, 13 Oct 2017 10:44:35


why do you want to split the files? 

Azure SQL DW works best with big files (ideally 500MB)
files bigger than 500MB are split and processed by multiple workers to achieve parallel processing

this only does not work if your files are compressed - if this is not the case I do not see any point to split the files

-gerhard

Murali Krishna V on Fri, 13 Oct 2017 10:53:25


My files are in 1TB, I want to split in 200MB files then want to push on to HDInsight.

Please suggest me best practice of split in ADF then how to push on to HDInsight.

ThirdEye Data on Fri, 13 Oct 2017 18:44:15


We have done a lot of such splitting jobs - but they range from using Linux's Split commands to writing MR jobs.

So you need a program that will break down a 100GB file into 10 files of 10 GB each?
But this may truncate data and cause data corruption when the split files are combined back together.

Can you please be a bit more specific on your requirement?

------------------

ThirdEye Data
https://thirdeyedata.io/