Mock Data Providers

Prerequisites

“Mocking” data for training requires a basic understanding of how FLOps manages ML data on learner nodes. Ensure you have read the respective documentation.

FLOps provides a mock data provider (MDP) service (container image) that helps newcomers or those without access to edge devices to get started quickly and use FLOps, e.g., on a single machine. The MDP is an example implementation for your edge devices to follow to ensure they can correctly send their data to the orchestrated learner nodes.

Requirements

An MDP has to be deployed as an Oakestra service on the same worker node where later training should occur. These worker nodes must have the FLOps-learner addon activated.

The MDP and any FLOps project require data tags in their SLAs. These tags have to match; otherwise, the learner will not find the data, and training will fail.

You must ensure that the dataset you are requesting can be correctly transformed by the ML Git repository to use for training. I.e., always double-check if the dataset is compatible with your specified ML training code.

MDP SLA Format

Here is an SLA example for the cifar10 dataset that the MDP will split into three partitions.

{  % The ID has to match the user’s orchestrator ID.
  "customerID": "Admin",
  "mock_data_configuration": {
    % This value can be any dataset name available in Hugging Face.
    "dataset_name": "cifar10",
    "number_of_partitions": 3,
    % This tag must match the one for your FLOps project,
    % otherwise the learner will not find the data and training will fail.
    "data_tag": "my_tag"
  }
}

To deploy an MDP you have to send an API request to the FLOps manager with a fitting SLA. The POST endpoint is: /api/flops/mocks

Make mocking data easy

The oak-cli provides predefined MDP SLAs and can deploy them for you with a single command.

Architecture

Learner Node
Legend
A
Z
DataTag1.hash1
DataTag1.hash2
FLOps Helper SLA
User ML Code
D
...
ML Data Volume
A
Z
ML Data Server
Arrow Flight
Server
Mock Data Provider
Arrow Flight
Client
A
Z
Flower
Datasets
D
A
Z
A
Z
......
D
Parquet Format
D
Arrow Format
Final Training Data
D
Learner Node
Legend
A
Z
DataTag1.hash1
DataTag1.hash2
FLOps Helper SLA
User ML Code
D
...
ML Data Volume
A
Z
ML Data Server
Arrow Flight
Server
Mock Data Provider
Arrow Flight
Client
A
Z
Flower
Datasets
D
A
Z
A
Z
......
D
Parquet Format
D
Arrow Format
Final Training Data
D

The MDP uses Flower Datasets to fetch monolithic datasets from Hugging Face and split them into heterogeneous partitions. Each partition is sent (via Arrow Flight) to the ML Data Server located on the same worker node, just as if multiple edge devices had sent their data to the data server.

Once training starts (assuming the data tags match), the learner service will fetch the stored data and merge it back together.

Mock Data Provider Implementation

Look at the source code that powers the mock data providers