Skip to content
forked from snoweye/pbdXGB

Extreme Gradient Boosting for Distributed Learning

License

Notifications You must be signed in to change notification settings

RBigData/pbdXGB

 
 

Repository files navigation

pbdXGB

  • License: License
  • Download: Download
  • Status: Build Status Appveyor Build status
  • Author: See section below.

In pbdXGB, we write programs in the "Single Program/Multiple Data" or SPMD style, typically executed via pbdMPI and MPI. Each process (MPI rank) gets runs the same copy of the program as every other process, but operates on its own data. The results will be collected, compared, and distributed to each process via MPI allreduce-like calls.

Usage

Below is a basic program siminar to the original "xgboost" example, but in SPMD concepts using MPI or pbdMPI:

suppressMessages(library(pbdMPI, quietly = TRUE))
suppressMessages(library(pbdXGB, quietly = TRUE))
init()

### Commonly owned all training data
data(agaricus.train, package = 'xgboost')
train.all <- agaricus.train
train.mat.all <- as.matrix(train.all$data)
label.all <- train.all$label

### Locally owned distributed training data
train.rows <- get.jid(nrow(train.mat.all))
train.mat <- train.mat.all[train.rows, ]
label <- label.all[train.rows]

### Train the model from distributed training data
mdl <- xgboost(data = train.mat, label = label,
               max.depth = 2, eta = 1, nthread = 1,
               nrounds = 10, objective = 'binary:logistic',
               verbose = 0)
comm.print(mdl$evaluation_log$train_error, all.rank = TRUE)

### Train with xgboost::xgboost() on all training data
if (comm.rank() == 0){
  mdl.all <- xgboost::xgboost(data = train.mat.all, label = label.all,
                              max.depth = 2, eta = 1, nthread = 2,
                              nrounds = 10, objective = 'binary:logistic',
                              verbose = 0)
  print(mdl.all$evaluation_log$train_error)
}


### Commonly owned all testing data
data(agaricus.test, package = 'xgboost')
test.all <- agaricus.test
test.mat.all <- as.matrix(test.all$data)

### Locally owned distributed testing data
test.rows <- get.jid(nrow(test.mat.all))
test.mat <- test.mat.all[test.rows, ]

### Predict the distributed testing data
pmdl <- predict(mdl, test.mat)
comm.print(pmdl[1:5], all.rank = TRUE)  # First five only

### Predict all testing data
if(comm.rank() == 0){
  pmdl.all <- predict(mdl.all, test.mat.all)

  tmp.rows <- get.jid(nrow(test.mat.all), all = TRUE)
  print(lapply(tmp.rows, function(x) pmdl.all[x[1:5]]))
}

finalize()

Save the code in a file, say, mpi_xgb.r and run it in 2 processes via:

mpirun -np 2 Rscript mpi_xgb.r

Installation

pbdXGB requires

  • R version 3.0.0 or higher
  • A system installation of MPI:
    • SUN HPC 8.2.1 (OpenMPI) for Solaris
    • OpenMPI for Linux
    • OpenMPI for Mac OS X
    • MS-MPI for Windows
  • pbdR/R packages:
    • pbdMPI
    • xgboost

Authors

pbdXGB is authored and maintained by the pbdR core team:

  • Wei-Chen Chen
  • Drew Schmidt

With additional contributions from:

  • Tianqi Chen (xgboost implementation)
  • Tong He (xgboost implementation)
  • xgboost R authors and contributors (some functions are modified from xgboost package)
  • XGBoost contributors (base XGBoost implementation)
  • The R Core team (some functions are modified from R)

About

Extreme Gradient Boosting for Distributed Learning

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 84.4%
  • Cuda 12.4%
  • R 2.0%
  • C 0.9%
  • CMake 0.2%
  • M4 0.1%