top of page

Horovod - Accelerated multi-GPU AI training toolkit

Updated: Jun 15, 2023

In terms of functionality, Horovod is akin to MPI(Message Passing Interface). When you have an HPC application and you want to run on multiple processors, you use MPI. When you have AI workload and you want to deploy it on multiple GPUs, you can use Horovod

Uber developed Horovod by optimising the me

ssage passing algorithms for collective operations.

Some useful Horovod commands

pip uninstall horovod
pip install --no-cache-dir horovod


[1]. Segmentation fault with tensorflow 1.14 or higher mentioning hwloc

If you are using TensorFlow 1.14 or 1.15 and are getting a segmentation fault, check whether it mentions hwloc:

… Signal: Segmentation fault (11) Signal code: Address not mapped (1) Failing at address: 0x99 [ 0] /lib/x86_64-linux-gnu/[0x7f309d34ff20] [ 1] /usr/lib/x86_64-linux-gnu/[0x7f3042871ca6] …

If it does, this could be a conflict with the hwloc symbols explorted from TensorFlow.

To fix this, locate your hwloc library with ldconfig -p | grep, and then set LD_PRELOAD. For example:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/ python -c ‘import horovod.tensorflow as hvd; hvd.init()’

Additional Reading

bottom of page