XGBoost Inference on the adult Dataset Using FHE¶

Expected RAM usage: 25 GB. Expected runtime: 3-6 minutes.

Introduction¶

This example demonstrates how to perform inference over encrypted data with an XGBoost model. We will work with the UCI-adult dataset [1-2], which predicts whether a person's income level is above 50K a year. First, a plain XGBoost model will be trained on the iris dataset in the clear. Then, the trained XGBoost model will be encrypted and used to run prediction over an encrypted batch of samples from the adult dataset.

Step 1. Training a plain XGBoost model¶

We train an XGBoost model with the adult dataset, in plaintext.

1.1 Decide whether this demo will be run on GPU¶

Running on GPU is only possible if the used machine has a GPU and helayers was compiled to use GPU. If these conditions are satisfied, changing the flag below to True will make the demo run on GPU.

In [1]:
run_with_gpu = False

1.2. We start with some imports:¶

In [2]:
import numpy as np
import math
import os
import pyhelayers
import shutil
from sklearn import datasets
from sklearn.model_selection import train_test_split
from utils import get_used_ram, get_data_sets_dir
from xgboost import XGBClassifier
import pandas as pd

1.3. Load the adult dataset¶

In [3]:
def preprocess(X, y):
    X['marital-status'] = X['marital-status'].str.strip()
    X['marital-status'] = X['marital-status'].replace(['Married-civ-spouse','Married-spouse-absent','Married-AF-spouse'], 'Married')
    X['marital-status'] = X['marital-status'].replace(['Never-married','Divorced','Separated','Widowed'], 'Single')
    X['marital-status'] = X['marital-status'].map({'Married':0, 'Single':1})
    X['marital-status'] = X['marital-status'].astype('int')
    X = X[['age', 'education-num', 'marital-status', 'hours-per-week', 'capital-loss', 'capital-gain']]
    y=y.str.strip().map({'<=50K': 0, '>50K': 1}).astype('int')
    return (X, y)
    
column_names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num',
 'marital-status', 'occupation', 'relationship', 'race', 'sex',
  'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label']
INPUT_DIR = os.path.join(get_data_sets_dir(), 'uci_adult')

train_data = pd.read_csv(os.path.join(INPUT_DIR, "adult.data"), names=column_names, header=None,
 index_col=False, engine='python')
X_train = train_data.iloc[:,:-1]
y_train = train_data.iloc[:,-1]
X_train, y_train = preprocess(X_train, y_train)

test_data = pd.read_csv(os.path.join(INPUT_DIR, "adult.test"), names=column_names, header=None,
 index_col=False, skiprows=1, sep="[,.]", engine='python')
X_test = test_data.iloc[:,:-1]
y_test = test_data.iloc[:,-1]
X_test, y_test = preprocess(X_test, y_test)

1.4. Train the XGBoost model¶

We use XGBoost python library to train the XGBoost model with the adult dataset, and dump the resulting model to a JSON file. This JSON file will later be used to initialize the encrypted XGBoost model.

In [4]:
clf = XGBClassifier(eta=0.2, gamma=3.6, max_depth=3,
 min_child_weight=3, subsample=0.8, objective="binary:logistic",
 scale_pos_weight=14.978045943588253, eval_metric = "aucpr", n_estimators=10)
clf.fit(X_train, y_train)
model_dir = os.path.join('data', 'adult_xgboost')
os.makedirs(model_dir, exist_ok=True)
model_path = os.path.join(model_dir, 'xgb.json')
clf.save_model(model_path)
print('plain XGBoost model saved')
plain XGBoost model saved

1.5 Evaluate the XGBoost model¶

We compute f1 score for the above trained XGBoost. This score will later be compared with the f1-score for the prediction over encrypted data.

In [22]:
from sklearn.metrics import f1_score
plain_xgb_preds = clf.predict(X_test)
f1_plain = f1_score(y_test, plain_xgb_preds)
print('plain XGBoost f1 score =', f1_plain)
plain XGBoost f1 score = 0.5488528915927582

Step 2. FHE inference¶

In this step, we will encrypt the above trained XGBoost model and the test samples from the adult dataset. The encrypted XGBoost model will be used to run prediction over the encrypted adult samples.

2.1 Compute the feature ranges¶

Our XGBoost implementaiton requires the users to specify the minimum and maximum values of each feature. Here, we extract this info from the training data and assume it will also be relevant to the test data.

In [23]:
def get_feature_range(col):
    return (col.min(), col.max())
    
feature_ranges = []
for col in X_train:
    feature_ranges.append(get_feature_range(X_train[col]))

2.2. Initialize a PlainXGBoost object¶

We intialize a PlainXGBoost object using the above trained XGBoost model. This object holds the XGBoost weights in the plain and will later be encrypted and used for prediction over encrypted data.

In [24]:
hyper_params = pyhelayers.PlainModelHyperParams()
hyper_params.feature_ranges = feature_ranges
hyper_params.grep = 4
hyper_params.frep = 1
plain_xgb = pyhelayers.PlainModel.create(hyper_params, [model_path])

2.3 Define HE run requirements¶

These requirements specify how the HE encryption should be configured. Here, we require the HE encryption to be done with HEaaN CKKS encryption scheme.

In [25]:
he_run_req = pyhelayers.HeRunRequirements()
he_run_req.set_he_context_options([pyhelayers.HeaanContext()])

2.4 Compile the plain model and HE run requirements into HE profile¶

The compilation produces an HE profile which holds encryption-specific parameters.

In [26]:
profile = pyhelayers.HeModel.compile(plain_xgb, he_run_req)

2.5 Initialize the HE context¶

Once the HE profile is ready, we use it to initialize the context. If run_with_gpu flag is set, we update the he_context to use a GPU device by default. Otherwise, we use CPU as usual.

In [27]:
he_context = pyhelayers.HeModel.create_context(profile)
if run_with_gpu:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_GPU)
else:
    he_context.set_default_device(pyhelayers.DeviceType.DEVICE_CPU)

2.6. Initialize the XGBoost model and attach output storage¶

We initialize the HE XGBoost model using the plain model created above. We also attach an output directory to the model. This directory will be used to store the encrypted trees and load them on demand upon prediction. Using an output storage prevents out of memory errors in case the encrypted XGBoost model is too large.

In [28]:
xgb = plain_xgb.get_empty_he_model(he_context)
storage_dir = os.path.join('outputs', 'xgb_storage')
os.makedirs(storage_dir, exist_ok=True)
fstorage=pyhelayers.FileStorage(storage_dir,create=True)
xgb.attach_output_storage(fstorage)

2.7 Encrypt the XGboost model¶

In [29]:
xgb.encode_encrypt(plain_xgb, profile)
print('FHE XGBoost model encrypted and initialized')
FHE XGBoost model encrypted and initialized

2.8 Get an IoProcessor from the HE XGBoost model.¶

The IoProcessor object will be used to encrypt and decrypt the input and output of the FHE prediction.

In [30]:
iop = xgb.create_io_processor()

2.9. Encrypt the test samples¶

We encrypt the test samples using the IoProcessor created above.

In [31]:
X_test_enc = pyhelayers.EncryptedData(he_context)
iop.encode_encrypt_inputs_for_predict(X_test_enc, [X_test])
print('input data encrypted')
input data encrypted

2.10 Flush the XGBoost model to the output storage¶

flush_to_storage() function stores the XGBoost metadata in the output storage directory.

In [32]:
xgb.flush_to_storage()

2.11 Load the XGBoost model¶

When we encrypted the model and flushed it to the file storage we specified, all the encrypted trees were stored in the folder we specified. This folder can now be moved to a cloud server and used for prediction. Here we simply create a new XGBoost model and attach to it the save folder. Upon prediction, the encrypted trees will be loaded on demand from the file system.

In [33]:
xgb_server = plain_xgb.get_empty_he_model(he_context)
xgb_server.attach_input_storage(fstorage)

2.12 Run prediction over the encrypted data¶

We perform FHE prediction on the encrypted test samples, using the encrypted XGBoost model. The resulting predictions are encrypted as well, and will next be decrypted and compared to the expected labels.

In [34]:
res = pyhelayers.EncryptedData(he_context)
xgb_server.predict(res, X_test_enc)
print('prediction ready')
prediction ready

Step 3. Decrypt the prediction results¶

We decrypt the prediction results and then convert the predicted float output values to integer labels.

In [35]:
res_plain = iop.decrypt_decode_output(res)
res_plain = np.where(res_plain > 0, 1, 0)

3.1. Evaluate the FHE prediction¶

We compute the f1 score of the FHE prediction and verify that it is very close to the f1 score of the plain prediction

In [36]:
f1_fhe = f1_score(y_test, res_plain)
print('FHE XGBoost f1 score =', f1_fhe)
assert(f1_fhe >= f1_plain - 0.1)
FHE XGBoost f1 score = 0.5494344057587777
In [37]:
print("RAM usage:", get_used_ram(), "MB")
RAM usage: 1730.9765625 MB

3.2 Remove the output storage directory¶

We remove the directory that was used to store the encrypted trees of the XGBoost

In [38]:
shutil.rmtree('outputs')

Citations¶

[1] Kohavi, R., Becker, B.: Uci machine learning repository - adult dataset (1996), https://archive.ics.uci.edu/ml/datasets/adult18.

[2] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.