Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
386 changes: 386 additions & 0 deletions notebooks/XGboost_Demo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,386 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction to XGBoost with RAPIDS\n",
"#### By Paul Hendricks\n",
"-------\n",
"\n",
"While the world’s data doubles each year, CPU computing has hit a brick wall with the end of Moore’s law. For the same reasons, scientific computing and deep learning has turned to NVIDIA GPU acceleration, data analytics and machine learning where GPU acceleration is ideal. \n",
"\n",
"NVIDIA created RAPIDS – an open-source data analytics and machine learning acceleration platform that leverages GPUs to accelerate computations. RAPIDS is based on Python, has pandas-like and Scikit-Learn-like interfaces, is built on Apache Arrow in-memory data format, and can scale from 1 to multi-GPU to multi-nodes. RAPIDS integrates easily into the world’s most popular data science Python-based workflows. RAPIDS accelerates data science end-to-end – from data prep, to machine learning, to deep learning. And through Arrow, Spark users can easily move data into the RAPIDS platform for acceleration.\n",
"\n",
"In this notebook, we'll show the acceleration one can gain by using GPUs with XGBoost in RAPIDS.\n",
"\n",
"**Table of Contents**\n",
"\n",
"* Setup\n",
"* Load Libraries\n",
"* Load/Simulate Data\n",
" * Load Data\n",
" * Simulate Data\n",
" * Split Data\n",
" * Check Dimensions\n",
"* Convert NumPy data to DMatrix format\n",
"* Set Parameters\n",
"* Train Model\n",
"* Conclusion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"To start, let's see what hardware we're working with."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:03:38.237293Z",
"start_time": "2018-11-06T21:03:37.388285Z"
}
},
"outputs": [],
"source": [
"!nvidia-smi"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let's see what CUDA version we have."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:03:39.490984Z",
"start_time": "2018-11-06T21:03:39.134608Z"
}
},
"outputs": [],
"source": [
"!nvcc --version"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load Libraries\n",
"\n",
"Let's load some of the libraries within the RAPIDs ecosystem and see which versions we have."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:03:41.067879Z",
"start_time": "2018-11-06T21:03:40.256654Z"
}
},
"outputs": [],
"source": [
"import numpy as np; print('numpy Version:', np.__version__)\n",
"import pandas as pd; print('pandas Version:', pd.__version__)\n",
"import xgboost as xgb; print('XGBoost Version:', xgb.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load/Simulate data\n",
"\n",
"### Load Data\n",
"\n",
"We can load the data using `pandas.read_csv`.\n",
"\n",
"### Simulate Data\n",
"\n",
"Alternatively, we can simulate data for our train and validation datasets. The features will be tabular with `n_rows` and `n_columns` in the training dataset, where each value is either of type `np.float32` if the data is numerical or `np.uint8` if the data is categorical. Both numerical and categorical data can also be combined; for this experiment, we have ignored this combination."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# helper function for simulating data\n",
"def simulate_data(m, n, k=2, numerical=False):\n",
" if numerical:\n",
" features = np.random.rand(m, n)\n",
" else:\n",
" features = np.random.randint(2, size=(m, n))\n",
" labels = np.random.randint(k, size=m)\n",
" return np.c_[labels, features].astype(np.float32)\n",
"\n",
"\n",
"# helper function for loading data\n",
"def load_data(filename, n_rows):\n",
" if n_rows >= 1e9:\n",
" df = pd.read_csv(filename)\n",
" else:\n",
" df = pd.read_csv(filename, nrows=n_rows)\n",
" return df.values.astype(np.float32)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# settings\n",
"LOAD = False\n",
"n_rows = int(1e5)\n",
"n_columns = int(100)\n",
"n_categories = 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"\n",
"if LOAD:\n",
" dataset = load_data('/tmp', n_rows)\n",
"else:\n",
" dataset = simulate_data(n_rows, n_columns, n_categories)\n",
"print(dataset.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split Data\n",
"\n",
"We'll split our dataset into a 80% training dataset and a 20% validation dataset."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"# identify shape and indices\n",
"n_rows, n_columns = dataset.shape\n",
"train_size = 0.80\n",
"train_index = int(n_rows * train_size)\n",
"\n",
"# split X, y\n",
"X, y = dataset[:, 1:], dataset[:, 0]\n",
"del dataset\n",
"\n",
"# split train data\n",
"X_train, y_train = X[:train_index, :], y[:train_index]\n",
"\n",
"# split validation data\n",
"X_validation, y_validation = X[train_index:, :], y[train_index:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Check Dimensions\n",
"\n",
"We can check the dimensions and proportions of our training and validation datasets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check dimensions\n",
"print('X_train: ', X_train.shape, X_train.dtype, 'y_train: ', y_train.shape, y_train.dtype)\n",
"print('X_validation', X_validation.shape, X_validation.dtype, 'y_validation: ', y_validation.shape, y_validation.dtype)\n",
"\n",
"# check the proportions\n",
"total = X_train.shape[0] + X_validation.shape[0]\n",
"print('X_train proportion:', X_train.shape[0] / total)\n",
"print('X_validation proportion:', X_validation.shape[0] / total)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Convert NumPy data to DMatrix format\n",
"\n",
"With out data simulated and formatted as NumPy arrays, our next step is to convert this to a `DMatrix` object that XGBoost can work with. We can instantiate an object of the `xgboost.DMatrix` by passing in the feature matrix as the first argument followed by the label vector using the `label=` keyword argument. To learn more about XGBoost's support for data structures other than NumPy arrays, see the documentation for the Data Interface:\n",
"\n",
"\n",
"https://xgboost.readthedocs.io/en/latest/python/python_intro.html#data-interface\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:03:55.278322Z",
"start_time": "2018-11-06T21:03:54.059643Z"
}
},
"outputs": [],
"source": [
"%%time\n",
"\n",
"dtrain = xgb.DMatrix(X_train, label=y_train)\n",
"dvalidation = xgb.DMatrix(X_validation, label=y_validation)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Set Parameters\n",
"\n",
"There are a number of parameters that can be set before XGBoost can be run. \n",
"\n",
"* General parameters relate to which booster we are using to do boosting, commonly tree or linear model\n",
"* Booster parameters depend on which booster you have chosen\n",
"* Learning task parameters decide on the learning scenario. For example, regression tasks may use different parameters with ranking tasks.\n",
"\n",
"For more information on the configurable parameters within the XGBoost module, see the documentation here:\n",
"\n",
"\n",
"https://xgboost.readthedocs.io/en/latest/parameter.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:03:57.443698Z",
"start_time": "2018-11-06T21:03:57.438288Z"
}
},
"outputs": [],
"source": [
"# instantiate params\n",
"params = {}\n",
"\n",
"# general params\n",
"general_params = {'verbosity': 0}\n",
"params.update(general_params)\n",
"\n",
"# booster params\n",
"booster_params = {}\n",
"booster_params['tree_method'] = 'hist'\n",
"booster_params['device'] = 'cuda'\n",
"params.update(booster_params)\n",
"\n",
"# learning task params\n",
"learning_task_params = {'eval_metric': 'auc', 'objective': 'binary:logistic'}\n",
"params.update(learning_task_params)\n",
"print(params)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train Model\n",
"\n",
"Now it's time to train our model! We can use the `xgb.train` function and pass in the parameters, training dataset, the number of boosting iterations, and the list of items to be evaluated during training. For more information on the parameters that can be passed into `xgb.train`, check out the documentation:\n",
"\n",
"\n",
"https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"# model training settings\n",
"evallist = [(dvalidation, 'validation'), (dtrain, 'train')]\n",
"num_round = 10"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2018-11-06T21:04:50.201308Z",
"start_time": "2018-11-06T21:04:00.363740Z"
}
},
"outputs": [],
"source": [
"%%time\n",
"\n",
"bst = xgb.train(params, dtrain, num_round, evallist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\n",
"\n",
"To learn more about RAPIDS, be sure to check out: \n",
"\n",
"* [Open Source Website](http://rapids.ai)\n",
"* [GitHub](https://github.com/rapidsai/)\n",
"* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)\n",
"* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)\n",
"* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)\n",
"* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Loading