HDInsight Cluster (spark) not writting correctly on Azure Storage

Joao1293 on Mon, 20 Feb 2017 19:55:02

This issue started happening today, the same code was working on Friday. I am saving a RDD Object File on azure storage, and instead of saving wasbs://etc@etc.blob.core.windows.net/myRDD/[part-0000,part-0001, ... , part-000N,_SUCCESS], it is saving some kind of _temporary file structure wasbs://etc@etc.blob.core.windows.net/myRDD/[_temporary] with some of the "parts" written inside. Any ideas on this problem? Thanks !


AshokPeddakotla-MSFT on Tue, 21 Feb 2017 14:22:59

Just to confirm when you say it was working, have you made any changes to the cluster?

What is the status of the job? Part files with _temporary appears when the job is still running and if you try to check the files.

Could you try again and confirm if the issue still occurs?

I would suggest you restart the services once and see if that makes any difference.

Joao1293 on Tue, 21 Feb 2017 20:17:13

Well, when I need to increase its size, usually I delete the current cluster and deploy a bigger one, so I do not believe it is a problem.

Livy tells me that the status of the job is success. Those temporary files used to appear before the job ended but now its still there and it is not saving in the correct format. I have just deployed a new cluster and ran the code again. Same thing, temporary files on the place where it should have saved a rdd.saveAsObjectFile. Also, there is a _SUCCESS inside the directory, next to the _temporary.

Joao1293 on Wed, 22 Feb 2017 14:56:51

I've managed to fix it, I moved the rdd.saveAsObjectFile to the line right after I cache it, now it is saving properly. Thanks for the help

AshokPeddakotla-MSFT on Wed, 22 Feb 2017 17:06:34

Glad to hear that your issue is resolved!