In terms of functionality, Horovod is akin to MPI(Message Passing Interface). When you have an HPC application and you want to run on multiple processors, you use MPI. When you have AI workload and you want to deploy it on multiple GPUs, you can use Horovod
Uber developed Horovod by optimising the me
ssage passing algorithms for collective operations.
Some useful Horovod commands
pip uninstall horovod
pip install --no-cache-dir horovod
. Segmentation fault with tensorflow 1.14 or higher mentioning hwloc
If you are using TensorFlow 1.14 or 1.15 and are getting a segmentation fault, check whether it mentions hwloc:
… Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x99 [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f309d34ff20] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_hwloc_base_free_topology+0x76)[0x7f3042871ca6] …
If it does, this could be a conflict with the hwloc symbols explorted from TensorFlow.
To fix this, locate your hwloc library with ldconfig -p | grep libhwloc.so, and then set LD_PRELOAD. For example:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhwloc.so python -c ‘import horovod.tensorflow as hvd; hvd.init()’