Wht can be a reason for corrupted files on attached vm disks?

Category: azure vm

Question

Hodza Nassredin on Tue, 27 Sep 2016 16:17:48


We have a cassandra cluster in west europe. Deployed as described here http://hodzanassredin.github.io/2016/05/06/cassandra_deploy_azure.html

In a few words:

azure group create cassandra-group "West Europe"
azure network vnet create cassandra-group cassandra-net "West Europe" 
azure network vnet subnet create cassandra-group cassandra-net cass-sub -a 10.0.0.0/24
azure network public-ip create "cassandra-group" "cass-ip-1" "West Europe" -a Static
azure network public-ip create "cassandra-group" "cass-ip-2" "West Europe" -a Static
azure network public-ip create "cassandra-group" "cass-ip-3" "West Europe" -a Static
azure network nic create "cassandra-group" "cass-nic-1" "West Europe" --subnet-name "cass-sub" --subnet-vnet-name "cassandra-net" -p "cass-ip-1" -a "10.0.0.4"  
azure network nic create "cassandra-group" "cass-nic-2" "West Europe" --subnet-name "cass-sub" --subnet-vnet-name "cassandra-net" -p "cass-ip-2" -a "10.0.0.6"  
azure network nic create "cassandra-group" "cass-nic-3" "West Europe" --subnet-name "cass-sub" --subnet-vnet-name "cassandra-net" -p "cass-ip-3" -a "10.0.0.8"  
azure network nsg create "cassandra-group" "cass-nsg" "West Europe" 
azure network nsg rule create -g cassandra-group -a cass-nsg -n cass-rule -c Allow -p Tcp -r Inbound -y 100 -f Internet -o * -e * -u 9042
azure network nsg rule create -g cassandra-group -a cass-nsg -n ssh-rule -c Allow -p Tcp -r Inbound -y 200 -f Internet -o * -e * -u 22
azure network nic set cassandra-group cass-nic-1 -o cass-nsg
azure vm create cassandra-group cass01 "West Europe" --nic-name "cass-nic-1" -y linux -Q Canonical:UbuntuServer:15.10:15.10.201604050 -u username -M id_rsa.pub -z Standard_D2_v2 --vnet-name cassandra-net --vnet-subnet-name cass-sub
azure vm disk attach-new "cassandra-group" cass01 100 cass-data-01

After several months of work without any troubles, today our cluster is down. We have messages in logs about file corruptions on attached disks. Also during attempt to increase number of nodes, we found that ubuntu package manager, after donwloading, can't install packages because of wrong checksums. And sometimes network speed is extremely slow. About several kb per second.




Replies

Hodza Nassredin on Tue, 27 Sep 2016 16:48:24


Tested download speed. And it is ok.

Retrieving speedtest.net configuration...
Retrieving speedtest.net server list...
Testing from Microsoft (xxx)...
Selecting best server based on latency...
Hosted by NFOrce Entertainment B.V. (Amsterdam) [2.18 km]: 5.6 ms
Testing download speed........................................
Download: 504.21 Mbit/s
Testing upload speed..................................................
Upload: 395.42 Mbit/s

Ajay Kumar .N on Wed, 28 Sep 2016 13:54:35


Hello Hodza,

Thank you for posting here!

Just for clarification I would like to ask the following: 

  1. Does this issue happen on all the VMs or these specific VMs?
  2. When you say “can't install packages because of wrong checksums”, what packages are you attempting to install? What is the exact error message you receive?
  3. When you say “today our cluster is down”, what exactly happens? Do you receive any error message? Do you recall making any changes prior to this issue?

While Azure Storage provides data resiliency through automated replicas, this does not prevent your application code (or developers/users) from corrupting data through accidental or unintended deletion, update, and so on. Maintaining data fidelity in the face of application or user error requires more advanced techniques, such as copying the data to a secondary storage location with an audit log.

Kindly check the documentation below for more details on this:

Azure resiliency technical guidance: recovery from data corruption or accidental deletion

For detailed analysis of the disk corruption and in-depth troubleshooting on this issue kindly open a technical support ticket.

https://azure.microsoft.com/en-in/support/options/

Hope this points you in the right direction!

 

Regards,
Ajay

---------------------------------------------------------------------------------------------------

Kindly click "Mark as Answer" on the post that helps you, this can be beneficial to other community members reading the thread and ‘Vote as Helpful’

Hodza Nassredin on Sat, 01 Oct 2016 08:34:09


Cassandra logs contains a lot of messages about corrupted files. Everything was fine for several months. And package installation shows that after downloading package cannot be installed becouse of wrong checksums.

Today we have a problem on our proccessing farm. Services cant load files because of unexpected eof. 

So now I understand that problem is not in network or data disks but in os disks. Our services keep data on os disk(files are not as big) and cassandra stores data on data disks, but keeps cluster config on os disk. 

Cassandra was fixed by creating new cluster on other vms and importing data from old data disks.

Currently I'm trying to fix farm issues by overwriting broken files.

UPDATE: reuploaded files and now everything seems to be fine. But it is a workaround but I could not understand wtf with os disks.