ML Data Management

One noticeable trend in FL is the focus on virtual simulations with already existing data sets. In real scenarios, FL works on previously unseen heterogeneous data. FLOps aims to make FL more practical and application-oriented. To emphasize this, FLOps requires real data from edge devices or “mocked” data provided in such a way that it could have originated from real devices.

Mock Data Providers

Find out how to easily ‘mock’ real devices and data if you don’t have access to such devices or want to simply try out FLOps on a single machine here.

Architecture

Lightweight edge devices tend to lack the computational capabilities to perform machine learning. Instead, they can send their aggregated data to a more powerful learner node nearby. This learner node will collect and store data from different sources.

Orchestrated Learner NodeEdge Device
Arrow Flight
Client
ML Data Server
ML Data Volume
Arrow Flight
Server
Execution Runtime
Learner Container
User ML Code
Arrow Flight
Client
Data
Loading
Data
Manager
Model
Manager
Orchestrated Learner NodeEdge Device
Arrow Flight
Client
ML Data Server
ML Data Volume
Arrow Flight
Server
Execution Runtime
Learner Container
User ML Code
Arrow Flight
Client
Data
Loading
Data
Manager
Model
Manager

Once training starts, the deployed leaner service will request data that matches the data tags that were part of its SLA. The matching data partitions will be fetched, squashed into a single dataset, and delegated to the user-specified data preprocessing. Lastly, the data will be forwarded to the ML model for training.

Orchestrated Learner NodeEdge Device
Arrow Flight
Client
ML Data Server
Arrow Flight
Server
Execution Runtime
Learner Container
User ML Code
A5)
Fetch all files
with matching
DataTags
Arrow Flight
Client
A4)
loadDataFrom
MLDataServer()
A7) convertAndMerge()
Data
Loading
A8)
B3)
A3)
loadDataset()
Data
Manager
B2)
getData()
Model
Manager
A6)
Stream Matching
Files
A
ML Data Volume
A
B
C
A
Stream File
L
E
G
E
N
D
A
B
C
D
D
DataTag1.hash1
DataTag1.hash2
DataTag2.hash3
D
Final Training Data
Parquet Format
Arrow Format
User Format
M
B1)
setModelData()
A1) init()
A2)
prepareData()
A
B
B
A
B
A
A9) usersPreprocessing()
D
D
D
D
D
D
Orchestrated Learner NodeEdge Device
Arrow Flight
Client
ML Data Server
Arrow Flight
Server
Execution Runtime
Learner Container
User ML Code
A5)
Fetch all files
with matching
DataTags
Arrow Flight
Client
A4)
loadDataFrom
MLDataServer()
A7) convertAndMerge()
Data
Loading
A8)
B3)
A3)
loadDataset()
Data
Manager
B2)
getData()
Model
Manager
A6)
Stream Matching
Files
A
ML Data Volume
A
B
C
A
Stream File
L
E
G
E
N
D
A
B
C
D
D
DataTag1.hash1
DataTag1.hash2
DataTag2.hash3
D
Final Training Data
Parquet Format
Arrow Format
User Format
M
B1)
setModelData()
A1) init()
A2)
prepareData()
A
B
B
A
B
A
A9) usersPreprocessing()
D
D
D
D
D
D
ML Data Management Workflow

Find out why FLOps uses Arrow Flight here.