diff --git a/en/ranking/xgboost.md b/en/ranking/xgboost.md index 7bdf86488b..37adc9aa8f 100644 --- a/en/ranking/xgboost.md +++ b/en/ranking/xgboost.md @@ -5,18 +5,59 @@ redirect_from: - /en/xgboost --- -Vespa supports importing Gradient Boosting Decision Tree (GBDT) models trained with XGBoost. - -{% include warning.html content="Vespa only supports XGBoost models trained with XGBoost version 1.5 or earlier. Using models trained with later versions of XGBoost may result in prediction mismatches between Vespa and XGBoost's native predict functions."%} +Vespa supports importing Gradient Boosting Decision Tree (GBDT) models trained with +[XGBoost](https://xgboost.readthedocs.io/). ## Exporting models from XGBoost -Vespa supports importing XGBoost's JSON model dump, e.g. Python API -[xgboost.Booster.dump_model](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.dump_model). -When dumping the trained model, XGBoost allows users to set the `dump_format` to `json`, -and users can specify the feature names to be used in `fmap`. +Vespa supports two XGBoost model formats: UBJ (recommended) and JSON (legacy). + +### UBJ format (recommended) + +{% include version.html version="8.656.31" %} + +The recommended way to export an XGBoost model for Vespa is using +[`save_model()`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.save_model) +with the `.ubj` (Universal Binary JSON) extension. +UBJ has been the default XGBoost model format since XGBoost 2.1.0 and +preserves all model information: tree structure, `base_score`, feature names, and objective. + +
{% highlight python %}
+import xgboost as xgb
+import numpy as np
+
+# Train a model
+dtrain = xgb.DMatrix(np.random.rand(100, 2), label=np.random.randint(2, size=100),
+ feature_names=["feature_1", "feature_2"])
+param = {"max_depth": 2, "objective": "binary:logistic"}
+model = xgb.train(param, dtrain, num_boost_round=10)
+
+# Export as UBJ
+model.save_model("my_model.ubj")
+{% endhighlight %}
+
+{% include warning.html content='Do **not** use `save_model("model.json")` — this produces a different JSON structure
+(with a `learner` wrapper) that Vespa cannot parse. Only `dump_model()` with `dump_format="json"` is supported for the JSON path.' %}
+
+Since the UBJ format preserves the objective, Vespa automatically applies the correct
+transformation (e.g. sigmoid for logistic objectives) — no need to wrap the ranking expression manually.
+
+### JSON format (legacy)
-Here is an example of an XGBoost JSON model dump with 2 trees and maximum depth 1:
+Vespa also supports importing XGBoost's JSON model dump via
+[`dump_model()`](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.Booster.dump_model)
+with `dump_format="json"`.
+
+{% highlight python %}
+import xgboost as xgb
+
+dtrain = xgb.DMatrix("training-vectors.txt")
+param = {"base_score": 0, "max_depth": 1, "objective": "reg:squarederror"}
+bst = xgb.train(param, dtrain, 2)
+bst.dump_model("trained-model.json", fmap="feature-map.txt", with_stats=False, dump_format="json")
+{% endhighlight %}
+
+This produces a JSON array of tree objects:
{% highlight json %}
[
@@ -31,135 +72,242 @@ Here is an example of an XGBoost JSON model dump with 2 trees and maximum depth
]
{% endhighlight %}
-Notice the `split` attribute which represents the Vespa feature name. The `split` feature must resolve to a Vespa
-[rank feature](../reference/ranking/rank-features.html) defined in the [document schema](../basics/schemas.html). The feature can also
-be user defined features (for example using [functions](ranking-expressions-features.html#function-snippets)).
+The `split` attribute represents the Vespa feature name and must resolve to a Vespa
+[rank feature](../reference/ranking/rank-features.html) defined in the [document schema](../basics/schemas.html),
+or a user-defined [function](ranking-expressions-features.html#function-snippets).
-The above model JSON was produced using the XGBoost Python api with a regression objective:
+The training data is represented using [LibSVM text format](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html).
+See also a complete [XGBoost training notebook](https://github.com/vespa-engine/sample-apps/blob/master/commerce-product-ranking/notebooks/Train-xgboost.ipynb) using `ranking` objective.
-{% highlight python %}
-#!/usr/local/bin/python3
-import xgboost as xgb
+{% include warning.html content='`dump_model()` JSON does **not** preserve `base_score`.
+Set `base_score=0` during training, or accept that Vespa predictions will be offset.
+For logistic objectives, you must manually wrap the expression in `sigmoid()` (see [Objective types](#xgboost-objective-types)).' %}
-dtrain = xgb.DMatrix('training-vectors.txt')
-param = {'base_score':0, 'max_depth':1,'objective':'reg:squarederror'}
-bst = xgb.train(param, dtrain, 2)
-bst.dump_model("trained-model.json",fmap='feature-map.txt', with_stats=False, dump_format='json')
-{% endhighlight %}
+## Feature mappings from XGBoost to Vespa
-The training data is represented using [LibSVM text format](https://xgboost.readthedocs.io/en/latest/tutorials/input_format.html).
-See also a complete [XGBoost training notebook](https://github.com/vespa-engine/sample-apps/blob/master/commerce-product-ranking/notebooks/Train-xgboost.ipynb) using `ranking` objective.
+Model feature names must map to Vespa [rank features](../reference/ranking/rank-features.html).
+The mapping method depends on the model format.
+
+### UBJ feature mapping
+
+For UBJ models, place a features file named `+feature_1 +feature_2 +feature_3 ++ +For a model file named `my_model.ubj`, the features file must be named `my_model-features.txt`. + +Then define rank profile [functions](ranking-expressions-features.html#function-snippets) +that match the feature names and map them to Vespa document attributes or query features: + +
+schema my_app {
+ document my_app {
+ field price type double {
+ indexing: summary | attribute
+ }
+ field popularity type double {
+ indexing: summary | attribute
+ }
+ }
+ rank-profile my_rank_profile inherits default {
+ function feature_1() {
+ expression: attribute(price)
+ }
+ function feature_2() {
+ expression: attribute(popularity)
+ }
+ function feature_3() {
+ expression: query(user_context)
+ }
+ first-phase {
+ expression: xgboost("my_model.ubj")
+ }
+ }
+}
+
+
+If the model was trained with feature names that are valid Vespa rank features
+(e.g. `attribute(price)`), the functions are not needed — Vespa resolves them directly.
+
+### JSON feature mapping
+
+When using `dump_model()`, XGBoost names features by array index (`f0`, `f1`, ...) unless a feature map file (`fmap`) is provided.
+The `fmap` maps feature indices to named Vespa features:
-## Feature mappings from XGBoost to Vespa
-XGBoost is trained on array or array like data structures
-where features are named based on the index in the array as in the example above.
-To convert the XGBoost features we need to map feature indexes to actual Vespa features
-(native features or custom defined features):
-
-$ cat feature-map.txt |egrep "fieldMatch\(title\).completeness|fieldMatch\(title\).importance" +$ cat feature-map.txt | egrep "fieldMatch\(title\).completeness|fieldMatch\(title\).importance" 36 fieldMatch(title).completeness q 39 fieldMatch(title).importance q-In the feature mapping example, feature at index 36 maps to + +In this example, feature at index 36 maps to [fieldMatch(title).completeness](../reference/ranking/rank-features.html#fieldMatch(name).completeness) and index 39 maps to [fieldMatch(title).importance](../reference/ranking/rank-features.html#fieldMatch(name).importance). -The feature mapping format is not well described in the XGBoost documentation, - -but the [sample demo for binary classification](https://github.com/dmlc/xgboost/tree/master/demo) writes: -Format of ```feature-map.txt:
\n ```: - - "Feature id" must be from 0 to number of features, in sorted order. - - "i" means this feature is binary indicator feature - - "q" means this feature is a quantitative value, such as age, time, can be missing - - "int" means this feature is integer value (when int is hinted, the decision boundary will be integer) +Format of `feature-map.txt: \n`: + - Feature id must be from 0 to number of features, in sorted order + - `i` means this feature is a binary indicator feature + - `q` means this feature is a quantitative value, such as age, time, can be missing + - `int` means this feature is an integer value (when int is hinted, the decision boundary will be integer) -When using `pandas``DataFrame`'s with columns names, one does not need to provide feature mappings. - -See also a complete example of how to train a ranking function, using learning to rank -with ranking losses, in this -[notebook](https://github.com/vespa-engine/sample-apps/blob/master/commerce-product-ranking/notebooks/Train-xgboost.ipynb). +When using Pandas `DataFrame`s with column names, the feature names are embedded directly in the JSON dump +and a feature map file is not needed. ## Importing XGBoost models -To import the XGBoost model to Vespa, add the directory containing the -model to your application package under a specific directory named `models`. -For instance, if you would like to call the model above as `my_model`, -you would add it to the application package resulting in a directory structure like this: +To import an XGBoost model, add the model file to your application package +under the `models` directory. +For UBJ models, also include the corresponding `-features.txt` file:├── models -│ └── my_model.json +│ ├── my_model.ubj +│ ├── my_model-features.txt +│ └── legacy_model.json ├── schemas -│ └── main.sd +│ └── main.sd └── services.xmlAn application package can have multiple models. - - ## Ranking with XGBoost models Vespa has a `xgboost` [ranking feature](../reference/ranking/rank-features.html). This ranking feature specifies the model to use in a ranking expression. -Consider the following example: +Both UBJ and JSON models use the same ranking feature:-schema xgboost { +schema my_app { rank-profile prediction inherits default { first-phase { - expression: nativeRank + expression: nativeRank } second-phase { - expression: xgboost("my_model.json") + expression: xgboost("my_model.ubj") } } }-Here, we specify that the model `my_model.json` is applied to the top ranking documents by the first-phase ranking expression. -The query request must specify `prediction` as the [ranking.profile](../reference/api/query.html#ranking.profile). +Here, we specify that the model `my_model.ubj` is applied to the top ranking documents +by the first-phase ranking expression. +The query request must specify `prediction` as the [ranking.profile](../reference/api/query.html#ranking.profile). See also [Phased ranking](phased-ranking.html) on how to control number of data points/documents which is exposed to the model. Generally the run time complexity is determined by: * The number of documents evaluated [per thread](../performance/sizing-search.html) / number of nodes and the query filter -* The complexity of computing features. For example `fieldMatch` features are 100x more expensive that `nativeFieldMatch/nativeRank`. -* The number of XGboost trees and the maximum depth per tree +* The complexity of computing features. For example `fieldMatch` features are 100x more expensive than `nativeFieldMatch/nativeRank`. +* The number of XGBoost trees and the maximum depth per tree + +Serving latency can be brought down by [using multiple threads per query request](../performance/practical-search-performance-guide.html#multithreaded-search-and-ranking). + +## Categorical features + +{% include warning.html content="Vespa does **not** support XGBoost's native categorical splits +(`enable_categorical=True`). Deploying a model with native categorical splits will **silently produce +wrong predictions** — Vespa interprets the categorical split condition as a numerical threshold." %} + +To use categorical features with XGBoost models in Vespa, integer-encode them before training: + +{% highlight python %} +import xgboost as xgb +import pandas as pd + +# Integer-encode categorical features +category_map = {"small": 0, "medium": 1, "large": 2} +df["size"] = df["size_raw"].map(category_map).astype(float) + +# Train without enable_categorical — XGBoost uses numerical splits on the integers +dtrain = xgb.DMatrix(df[feature_cols], label=targets) +param = {"max_depth": 4, "objective": "binary:logistic"} +model = xgb.train(param, dtrain, num_boost_round=100) +model.save_model("my_model.ubj") +{% endhighlight %}+ +In the Vespa schema, store integer-encoded categoricals as `double` attributes +and map them via rank profile functions like any other numerical feature. + +Note: Vespa's [LightGBM](lightgbm.html) importer does support native categorical splits. + +## XGBoost objective types + +Vespa can import XGBoost models trained with any +[objective](https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters). +Common objectives include: + +* Regression `reg:squarederror` / `reg:logistic` +* Classification `binary:logistic` +* Ranking `rank:pairwise`, `rank:ndcg` and `rank:map` + +Vespa evaluates XGBoost models by summing the tree outputs. +The only objective-specific behavior is for logistic objectives (`reg:logistic` and `binary:logistic`), +where the raw tree sum must be passed through a sigmoid function to produce a probability. + +### UBJ models + +For UBJ models, Vespa reads the objective from the model file. +For logistic objectives, the `base_score` is automatically transformed (logit) +so the model output matches XGBoost's predictions without manual adjustment: + ++schema my_app { + rank-profile classify inherits default { + first-phase { + expression: xgboost("my_classifier.ubj") + } + } +} ++ +Note that UBJ does not automatically apply a sigmoid to the final output. +For logistic objectives, wrap the expression in `sigmoid()` if you need a probability: + ++schema my_app { + rank-profile classify inherits default { + first-phase { + expression: sigmoid(xgboost("my_classifier.ubj")) + } + } +} +-Serving latency can be brought down by [using multiple threads per query request](../performance/practical-search-performance-guide.html#multithreaded-search-and-ranking). +For ranking objectives and `reg:squarederror`, the raw tree sum can be used directly. -## XGBoost models -There are six different [objective](https://xgboost.readthedocs.io/en/stable/parameter.html#learning-task-parameters) -types that Vespa supports: +### JSON models -* Regression ```reg:squarederror``` / ```reg:logistic``` -* Classification ```binary:logistic``` -* Ranking ```rank:pairwise```, ```rank:ndcg``` and ```rank:map``` +For JSON models exported with `dump_model()`, the objective and `base_score` are **not** preserved. -For `reg:logistic` and `binary:logistic` the raw margin tree sum (Sum of all trees) -needs to be passed through the sigmoid function to represent the probability of class 1. -For regular regression the model can be directly imported -but the `base_score` should be set 0 as the `base_score` used during the training phase is not dumped with the model. +For `reg:logistic` and `binary:logistic`, the raw margin tree sum +needs to be passed through the [sigmoid function](../reference/ranking/ranking-expressions.html) +to represent the probability of class 1. +For regression, the model can be directly imported +but `base_score` should be set to 0 during training as it is not included in the dump. -An example model using the sklearn toy datasets is given below: +An example using the sklearn toy datasets:{% highlight python %} from sklearn import datasets import xgboost as xgb breast_cancer = datasets.load_breast_cancer() -c = xgb.XGBClassifier(n_estimators=20, objective='binary:logistic') -c.fit(breast_cancer.data,breast_cancer.target) -c.get_booster().dump_model("binary_breast_cancer.json", fmap='feature-map.txt', dump_format='json') -c.predict_proba(breast_cancer.data)[:,1] +c = xgb.XGBClassifier(n_estimators=20, objective="binary:logistic") +c.fit(breast_cancer.data, breast_cancer.target) +c.get_booster().dump_model("binary_breast_cancer.json", fmap="feature-map.txt", dump_format="json") +c.predict_proba(breast_cancer.data)[:, 1] {% endhighlight %}-To represent the ```predict_proba``` function of XGBoost for the binary classifier in Vespa, -we need to use the [sigmoid function](../reference/ranking/ranking-expressions.html): +To represent the `predict_proba` function of XGBoost for the binary classifier in Vespa, +use the [sigmoid function](../reference/ranking/ranking-expressions.html):-schema xgboost { +schema my_app { rank-profile prediction-binary inherits default { first-phase { expression: sigmoid(xgboost("binary_breast_cancer.json")) @@ -168,20 +316,39 @@ schema xgboost { }-## Debugging Vespa inference score versus XGBoost predict score - -* When dumping XGBoost models to a JSON representation some of the model information is lost - (e.g. the `base_score` or the optimal number of trees if trained with early stopping). +When the `base_score` is not the default (0.5), the sigmoid alone is insufficient. +The full formula accounting for `base_score` is: + ++schema my_app { + rank-profile prediction-binary inherits default { + constants { + base_score: 0.5 + } + first-phase { + expression: 1.0 / (1.0 + (1.0 - base_score) / base_score * exp(-(xgboost("binary_breast_cancer.json")))) + } + } +} ++ +Replace `0.5` with the actual `base_score` used during training. +See the [XGBoost System Test](https://github.com/vespa-engine/system-test/tree/master/tests/search/xgboost) for a complete working example. + +## Debugging Vespa inference score versus XGBoost predict score + +* For JSON models, the `base_score` and optimal number of trees (if trained with early stopping) are lost in the dump. + UBJ models preserve this information. XGBoost also has different predict functions (e.g. predict/predict_proba). The following [XGBoost System Test](https://github.com/vespa-engine/system-test/tree/master/tests/search/xgboost) - demonstrates how to represent different type of XGBoost models in Vespa. + demonstrates how to represent different types of XGBoost models in Vespa. * For training, features should be scraped from Vespa, using either `match-features` or `summary-features` so that features from offline training matches the online Vespa computed features. Dumping features can also help debug any differences by zooming into specific query,document pairs - using [recall](../reference/api/query.html#recall) parameter. + using [recall](../reference/api/query.html#recall) parameter. * It's also important to use the highest possible precision - when reading Vespa features for training as Vespa outputs features using `double` precision. - If the training routine rounds features to `float` or other more compact floating number representations, feature split decisions might differ in Vespa versus XGboost. -* In a distributed setting when multiple nodes uses the model, text matching features such as `nativeRank`, `nativFieldMatch`, `bm25` and `fieldMatch` - might differ, depending on which node produced the hit. The reason is that all these features use [term(n).significance](../reference/ranking/rank-features.html#query-features), which is computed locally indexed corpus. The `term(n).significance` feature + when reading Vespa features for training as Vespa outputs features using `double` precision. + If the training routine rounds features to `float` or other more compact floating number representations, feature split decisions might differ in Vespa versus XGBoost. +* In a distributed setting when multiple nodes use the model, text matching features such as `nativeRank`, `nativeFieldMatch`, `bm25` and `fieldMatch` + might differ, depending on which node produced the hit. The reason is that all these features use [term(n).significance](../reference/ranking/rank-features.html#query-features), which is computed from the locally indexed corpus. The `term(n).significance` feature is related to *Inverse Document Frequency (IDF)*. The `term(n).significance` should be set by a searcher in the container for global correctness as each node will estimate the significance values from the local corpus.