A Horovod MPI job is embedded as a Spark job using barrier execution mode. Spark-Deep-Learning by Databricks supports Horovod on Databricks clusters with the Machine Learning runtime. HorovodRunner runs distributed deep learning training jobs using Horovod. See sparkdl API documentation and Use XGBoost on Azure Databricks for more details. Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). On Databricks Runtime 5.0 ML and above, it launches the Horovod job as a distributed Spark job. Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. Databricks Runtime ML includes many external libraries, including TensorFlow, PyTorch, Horovod, scikit-learn and XGBoost, and provides extensions to improve performance, including GPU acceleration in XGBoost, distributed deep learning using HorovodRunner, and model checkpointing using a Databricks File System (DBFS) FUSE mount. class HorovodRunner(object):
    """
    HorovodRunner runs distributed deep learning training jobs using Horovod. MNIST mnist-tensorflow-keras

def get_dataset(num_classes, rank=0, size=1):
    from tensorflow import keras

On Databricks Runtime 5.0 ML and above, it launches the Horovod job as a distributed Spark job. The goal of Horovod is to make distributed deep learning fast and easy to use. Previously, to use HorovodRunner you would have to run a driver and at least one worker node. It makes running Horovod easy on Databricks by managing the cluster setup and integrating with Spark. By integrating Horovod with Spark's barrier mode, Databricks is able to provide higher stability for long-running deep learning training jobs on Spark. Use a pandas UDF instead.

from sparkdl import HorovodRunner
# run only 2 workers (rank0 and rank1)
hr = HorovodRunner(np=2)
main=train_fn,
    checkpoint_path="/dbfs/mnt/testblob/horovod_trained_model/checkpoint.ckpt",
    learning_rate=0.01)

API class sparkdl.HorovodRunner

def train_hvd():
    hvd.init()
    ..  # Horovod

Enabled HorovodRunner to run on only the driver node. It makes running Horovod easy on Databricks by managing the cluster setup and integrating with Spark. 