You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.
From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new xgboost() function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.
I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.
(Blocker) Enable categorical features for current DMatrix constructors (matrix, dgCMatrix, dgRMatrix).
Note: these objects are a list of arrays which aren't necessarily in a single memory chunk, and which can have types int (int32_t), double (float64), and potentially int64_t from package bit64.
I guess this and the first point could be done in the same PR since they might be touching similar code sections.
(Blocker) Enable multi-output input labels and predictions.
(Low priority) Add a mechanism to create a DMatrix object from arrow objects (from package "arrow"). Like for data frames, should automatically recognize categorical columns from the categorical arrow type.
Note: the idea here is to exploit functions that work directly on arrow format, without converting to base R arrays (which do not support all the arrow types) along the way.
Add an interface to create QuantileDMatrix objects from R, accepting the same kinds of inputs as DMatrix (data.frame, matrix, dgCMatrix, dgRMatrix, arrow if implemented, maybe float::float32), and also auto-recognizing categorical features for objects that have them (data frames and arrow tables).
(Low priority) Add methods to get additional info from DMatrix objects that are currently missing from the R package, such as get_quantile_cut (guess this is just a call to XGDMatrixGetQuantileCut?).
(Blocker) Move more DMatrix parameters that reference data towards xgb.DMatrix() function arguments, such as qid, group, label_lower_bound, label_upper_bound , etc.
Potentially a good reference could be the DMatrix python class.
Switch the current DMatrix creation function for R matrices towards the C function that uses array_interface.
Switch the predict method for the current booster to use "inplace predict" or other more efficient DMatrix creators when appropriate.
(Blocker) Remove all the public interface (functions, docs, tests, examples) around the Booster.handle class, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.
(Blocker) After the task above is done, switch the handle serialization mechanism to ALTREP and remove xgb.Booster.complete, which wouldn't be needed anymore.
This increases the R requirement to >= 4.3, so it requires modifying the CI jobs to update them all to this version of R and drop the older ones.
(Low priority) Implement serialization for DMatrix handles through the same ALTREP system as above. This idea was discarded (thread)
(Blocker) Remove the current xgboost() function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).
(Blocker) After support for data.frame and categorical features is added, then create a new xgboost() function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper over DMatrix + xgb.train but implementing the kind of idiomatic R interface (x/y only, no formula) described in the earlier thread, either with a separate function for the parameters or everything being passed in the main function.
It should return objects of a different class than xgb.train (perhaps the class could be named "xgboost").
This class should have its own predict method, again with a different interface than the booster's predict, as described in the first message here.
If this class needs to keep additional attributes, perhaps they could be kept as part of the JSON that gets serialized, otherwise should have a note about serialization and transferability with other interfaces.
This is probably the largest PR in terms of code (especially tests!!), so might need to be split into different batches. For example, support for custom objectives could be left out from the first PR.
(Blocker) After the new xgboost() x/y interface gets implemented, then modify other functions to accept these objects - e.g.:
Plotting function.
Feature importance function.
Serialization functions that are aimed at transferring models between interfaces.
All of these should keep in mind small details like base-1 indexing for tree numbers and similar.
(Blocker) Create examples and vignettes for the new xgboost() function.
(Low priority) Perhaps create a higher-level cv function for the new xgboost() interface.
Support creation of external memory objects with DataIter.
(Blocker) Enable quantile regression with multiple quantiles.
Switch the R package build system to CMake instead of autotools.
(Low priority) Distributed training, perhaps integration with RSpark.
Documentation and unified tests for 1-based indexing.
ref #9734
ref #9475
This issue is intended as a roadmap tracker for progress in bringing xgboost's R interface up to date and discussions around these tasks and coordination.
From the previous tasks, here I've made a list of potential tasks to take on, but I might be missing some things, and I've put the biggest task (new
xgboost()function) under a single bullet point while in practice it'll likely involve multiple rounds of PRs. Please feel free to add more tasks to this list.I've taken the liberty of classifying these issues in terms of whether they'd be blockers for releasing a new xgboost version or not, albeit some people might disagree with my assessments.
DMatrixconstructors (matrix,dgCMatrix,dgRMatrix).data.frameobjects, automatically settingfactorvariables to be of categorical type in the DMatrix. (Support dataframe data format in native XGBoost. #9828)int(int32_t),double(float64), and potentiallyint64_tfrom packagebit64.XGDMatrixNumNonMissing.XGDMatrixGetDataAsCSR.DMatrixobject fromarrowobjects (from package "arrow"). Like for data frames, should automatically recognize categorical columns from the categorical arrow type.QuantileDMatrixobjects from R, accepting the same kinds of inputs asDMatrix(data.frame,matrix,dgCMatrix,dgRMatrix,arrowif implemented, maybefloat::float32), and also auto-recognizing categorical features for objects that have them (data frames and arrow tables).DMatrixobjects that are currently missing from the R package, such asget_quantile_cut(guess this is just a call toXGDMatrixGetQuantileCut?).DMatrixparameters that reference data towardsxgb.DMatrix()function arguments, such asqid,group,label_lower_bound,label_upper_bound, etc.DMatrixcreation function for R matrices towards the C function that usesarray_interface.predictmethod for the current booster to use "inplace predict" or other more efficientDMatrixcreators when appropriate.Booster.handleclass, as well as the conversion methods from handle to booster and vice-versa, leaving only the booster for now.xgb.Booster.complete, which wouldn't be needed anymore.(Low priority) Implement serialization forThis idea was discarded (thread)DMatrixhandles through the same ALTREP system as above.xgboost()function, and remove the calls from all the places it gets used (tests, examples, vignettes, etc.).data.frameand categorical features is added, then create a newxgboost()function from scratch that wouldn't share any code base with the current function named like that, ideally working as a higher-level wrapper overDMatrix+xgb.trainbut implementing the kind of idiomatic R interface (x/y only, no formula) described in the earlier thread, either with a separate function for the parameters or everything being passed in the main function.xgb.train(perhaps the class could be named "xgboost").predictmethod, again with a different interface than the booster's predict, as described in the first message here.xgboost()x/y interface gets implemented, then modify other functions to accept these objects - e.g.:xgboost()function.xgboost()interface.DataIter.