Failing to launch an MPI job using all cores on Standard A8 instance

Category: azure batch

Question

CarlosFernandezMusoles on Mon, 22 Jan 2018 13:39:30


Hi all,

I have a working framework to spawn a pool on my Azure batch account and execute a job from a python script. The job consists of an MPI program written in C++ and it all seems to be working fine.

I am using Standard A8 instances (8 cores) with a custom VM image installed. Since standard A8 nodes come with 8 cores, I would like the MPI application to use 8 MPI processes per node. The problem is that I do not think this is happening, even when using --map-by ppr:8:core or -oversubscribe flags in my mpirun command. The reason I believe so is that when running my program on 24 MPI processes on 3 nodes the performance is severely degraded (43seconds execution time, 17seconds communications) with respect to running the same program on 24 MPI processes on 24 nodes (32seconds execution time, 9 seconds for communications).

This is the mpirun command I execute:

mpirun --prefix /home/openmpi-2.1.0/build -np $numProc --host $HOST_LIST --map-by ppr:1:core -wdir $AZ_BATCH_TASK_SHARED_DIR $AZ_BATCH_TASK_SHARED_DIR/distSim $RESULTS $COMM_PATTERN $PARTITIONING $SEED $ACTIVITY_FILE

where $HOST_LIST is a string with the addresses of the nodes to use in Azure (given by Azure on variable $AZ_BATCH_HOST_LIST) followed by ":8" each to specify I want 8 slots per node. $numProc is set to 24 in this example.

It feels like the cores are being oversubscribed (even though there should be 8 cores per node, and only 1 MPI process per core).

To add to it, I have added a line of code in my application that prints the cpu id used for all mpi processes, and I can see how often more than one MPI rank (process) uses the same CPU ID in the same node. (To get the cpu ID I use sched_getcpu() in the header file <utmpx.h>).

Any suggestions on what am I doing wrong? Can I ensure that only 1 MPI process is executed per core?

I am using OpenMPI 2.1.0 and the custom VM image is a Ubuntu 16 OS barebones with a few libraries installed.

Replies

Micah McKittrick on Sat, 24 Feb 2018 03:49:21


Hi Carlos, 

Apologies you have not received a response as of yet. I just wanted to check in and see if you are still seeing issues and if so see how i could help.