top of page

Horovod - Accelerated multi-GPU AI training toolkit

Updated: Jun 15, 2023



In terms of functionality, Horovod is akin to MPI(Message Passing Interface). When you have an HPC application and you want to run on multiple processors, you use MPI. When you have AI workload and you want to deploy it on multiple GPUs, you can use Horovod


Uber developed Horovod by optimising the me


ssage passing algorithms for collective operations.


Some useful Horovod commands

pip uninstall horovod
pip install --no-cache-dir horovod

Troubleshooting

[1]. Segmentation fault with tensorflow 1.14 or higher mentioning hwloc

If you are using TensorFlow 1.14 or 1.15 and are getting a segmentation fault, check whether it mentions hwloc:

… Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x99 [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f309d34ff20] [ 1] /usr/lib/x86_64-linux-gnu/libopen-pal.so.20(opal_hwloc_base_free_topology+0x76)[0x7f3042871ca6] …

If it does, this could be a conflict with the hwloc symbols explorted from TensorFlow.

To fix this, locate your hwloc library with ldconfig -p | grep libhwloc.so, and then set LD_PRELOAD. For example:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libhwloc.so python -c ‘import horovod.tensorflow as hvd; hvd.init()’

Additional Reading

https://horovod.readthedocs.io/


bottom of page