Building an ML Inference API, Part IV

Background

By the time the FastAPI inference service from Part II was in production, the .NET side around it had all the usual machinery: HTTP clients, retries, timeouts, circuit breakers, tracing. All necessary, but still plumbing. And in some downstream systems, even a clean network call was too expensive.

Most of what we were serving was LightGBM regression. Training, feature engineering, and retraining schedules all belonged in Python. Inference itself was just trees, which means comparisons and additions. There's no reason that has to sit behind another network call.

So I started experimenting with translating trained LightGBM models directly into C#. Then callers could invoke a method instead of making an HTTP request, and the runtime path lost a few things:

the network hop
retry logic around inference
the Python process
a separate service to scale

For regression, the mapping from tree structure to code is almost literal. You generate C# source from the model, compile it into a DLL, and do inference natively inside .NET with no model file at runtime. The catch is that giant generated if/else trees get ugly fast, and at some point it's cleaner to stop generating so much branching and represent the trees as data instead, then walk them with a small runtime. That ended up being the approach I preferred.

This post walks through both:

Why trees can be represented directly as code
How LightGBM trees map to C#
A tiny concrete example
Generating the if/else version
Where that starts to break down
An evaluator that treats the model as data
Extending the same idea to classification

Regression first, because it makes the idea easiest to see.

A regression tree is already code

One mental shift helps a lot:

A decision tree is not really something you translate into if/else. It already is if/else.

Take a toy tree:

If strength_diff <= -2.11
- predict 0.3145
Else
- predict 0.7237

That is literally:

if (features[5] <= -2.11)
    return 0.3145;
else
    return 0.7237;

That is not an approximation. That is the model.

And that’s why tree models are unusually portable compared with a lot of other ML models.

Boosting is just many trees adding corrections

A LightGBM regression model is an ensemble:

prediction =
    tree1(x)
  + tree2(x)
  + tree3(x)
  + ...

Each tree contributes a little correction.

In code:

public static double Predict(double[] f)
{
    double score = 0;

    score += Tree0(f);
    score += Tree1(f);
    score += Tree2(f);

    return score;
}

Each TreeN is nested branching logic.

That is the whole inference loop for regression.

Reading a LightGBM tree

A tree dump might look something like:

split_feature: strength_diff
threshold: -1.18

left_child:
   split_feature: minute
   threshold: 15
   left_child: leaf=0.48
   right_child: leaf=0.62

right_child:
   leaf=1.03

Which maps directly to:

if (strengthDiff <= -1.18)
{
    if (minute <= 15)
        return 0.48;
    else
        return 0.62;
}
else
{
    return 1.03;
}

Pretty much one to one:

LightGBM	C#
split_feature	feature index
threshold	comparison
left child	if branch
right child	else branch
leaf value	returned value
ensemble	sum of trees

Once that clicks, everything else follows naturally.

Tiny example

Suppose a model has:

Tree=0
num_leaves=3
split_feature=5 7
threshold=-1.18 15
left_child=1 -1
right_child=2 -2
leaf_value=0.48 0.62 1.03

Generated C#:

static double Tree0(double[] f)
{
    if (f[5] <= -1.18)
    {
        if (f[7] <= 15)
            return 0.48;
        else
            return 0.62;
    }

    return 1.03;
}

That is the tree.

If you have 300 trees, generate 300 methods. And this is surprisingly workable.

Verifying inference

Using one row from my model:

var result = Model.Predict([
1.2699999809265137,
3.380000114440918,
0.7300000190734863,
1.6299999952316284,
4.650000095367432,
-2.1100001335144043,
1.0,
28.0,
0.0,
0.0,
2.0,
2.0,
-2.0,
0.0,
2.0,
2.0,
-2.0,
7.0,
-0.40880000591278076,
1.0871999263763428,
427465.0
]);

Expected:

0.31458735

Generated predictor matched exactly.

Generating the C#

This should absolutely be generated. Do not hand write trees. For any real model, it gets impossible very quickly. I used Python to generate the C# code, but the language of the exporter does not matter much.

import lightgbm as lgb

booster = lgb.Booster(model_file="model.txt")
model_dump = booster.dump_model()

Trees live in:

model_dump["tree_info"]

Recursive emitter:

def emit_node(node, indent=1):
    pad = "    " * indent

    if "leaf_value" in node:
        return f"{pad}return {node['leaf_value']};\n"

    feature = node["split_feature"]
    threshold = node["threshold"]

    code = []

    code.append(
        f"{pad}if (features[{feature}] <= {threshold})\n"
        f"{pad}{{\n"
    )

    code.append(emit_node(node["left_child"], indent+1))

    code.append(
        f"{pad}}}\n"
        f"{pad}else\n"
        f"{pad}{{\n"
    )

    code.append(emit_node(node["right_child"], indent+1))

    code.append(f"{pad}}}\n")

    return "".join(code)

Generate methods:

for i, tree in enumerate(model_dump["tree_info"]):
    print(f"static double Tree{i}(double[] features)")
    print("{")
    print(emit_node(tree["tree_structure"]))
    print("}")

The basic version is quite small.

Why I liked this

Native inference

The generated predictor does not need a Python runtime or a LightGBM dependency. It is just .NET code.

Speed

Inference becomes a small set of cheap operations:

comparisons
branches
additions

Deployment

The model becomes source. Ship a DLL and you've shipped the model.

Debugging

You can inspect actual decision paths, which is useful in pricing or risk sensitive systems.

Where it starts breaking down

Say:

500 trees
depth 8
~255 nodes each

That’s potentially:

127,500 node checks

Now generated code gets huge.

You start getting:

giant files
ugly diffs
slower compile times
questionable JIT behavior

It still works. For example, one of my models had 150 trees and 25 features, and the generated C# was about 130k lines. If you use Rider or another IDE that parses C#, you will feel it slow down.

That pushed me toward a better representation.

Represent the model as data

Instead of generating giant branch forests, export the model the way it already exists internally:

feature arrays
thresholds
child pointers
leaf values

Then use a tiny evaluator:

private static double Eval(int node, ReadOnlySpan<double> f)
{
    while (true)
    {
        if (IsLeaf[node])
            return Value[node];

        if (f[Feature[node]] <= Threshold[node])
            node = Left[node];
        else
            node = Right[node];
    }
}

Boosting:

public static double Predict(ReadOnlySpan<double> f)
{
    double score = 0;

    for (int i = 0; i < TreeCount; i++)
        score += Eval(Roots[i], f);

    return score;
}

Still exact inference. Just much cleaner.

Why I ended up preferring this

Compared with giant generated if/else:

much smaller generated source
one evaluator method
cleaner diffs
easier codegen
friendlier for JIT

And source size grows mostly with model data, not duplicated branching syntax.

Classification extends naturally

Binary classification

Same trees. Usually sum the logits, then apply sigmoid:

probability = sigmoid(sum)

static double Sigmoid(double x)
{
   return 1 / (1 + Math.Exp(-x));
}

Then threshold. Same traversal, different output transform.

Multiclass

Often:

num_classes × boosting_rounds trees

Accumulate per class:

double[] scores = new double[3];

scores[0] += Tree0(f);
scores[1] += Tree1(f);
scores[2] += Tree2(f);

Then softmax:

static double[] Softmax(double[] x)
{
    var exp = x.Select(Math.Exp).ToArray();
    var sum = exp.Sum();
    return exp.Select(v => v / sum).ToArray();
}

Same trees, different aggregation.

Validate everything

Before optimizing, verify generated inference against the original model. Use multiple test cases, not just one row.

Assert.True(
    diff < 1e-4,
    $"Row {i} mismatch. Expected={expected}, Actual={actual}"
);

What's in a LightGBM dump, and what makes it into C#

Here is one real LightGBM tree from a text dump. Arrays are shortened for readability, but the shape is the same:

Tree=1
num_leaves=5
split_feature=11 2 17 18
split_gain=437375 65423.8 30686.3 15052.8
threshold=1.0000000180025095e-35 0.94499999284744274 15.500000000000002 0.47699999809265142
decision_type=2 2 2 2
left_child=1 -1 -2 -3
right_child=2 3 -4 -5
leaf_value=0.0017963732109250652 -0.0029226866697364827 0.012002031587390959 -0.0066260868638778623 0.0074389293400878229
leaf_weight=419486 548345 401639 1175784 300693
leaf_count=419486 548345 401639 1175784 300693
internal_value=3.11874e-06 0.00803391 -0.00396084 0.00239001
internal_weight=5.49575e+06 1.8162e+06 3.67954e+06 617181
internal_count=5495746 1816203 3679543 617181
is_linear=0
shrinkage=0.02

The dump contains more than the C# runtime needs. The generated code keeps the fields used to walk the tree and return a prediction: split_feature, threshold, left_child, right_child, and leaf_value. A few other fields are useful to understand, but they do not show up in the final arrays.

shrinkage is the tree learning rate. In many LightGBM text dumps, the shrinkage has already been folded into leaf_value, so the C# predictor can just add the tree result directly:

score += Eval(Roots[i], f);

If shrinkage were not already folded into the leaves, the runtime would need to multiply each tree output by a per tree shrinkage value. For the dumped models I was working with, ignoring the explicit shrinkage field was correct because the leaf values already contained it.

is_linear tells you whether the tree uses normal constant leaves or linear leaves. The exporter assumes is_linear=0, where each leaf returns one scalar value. That matches the usual LightGBM tree:

if feature <= threshold:
    return leaf_value

If is_linear=1, each leaf contains a small linear model instead of a single number. That needs a different inference engine. This exporter does not support that case.

internal_value is the value stored at an internal split node. You can think of it as the prediction at that point if the tree stopped there. It is useful for diagnostics, but inference does not return from internal nodes, so the C# code does not need it.

internal_weight is the weighted amount of training data that reached an internal node. LightGBM uses it while training for split decisions, regularization, and pruning. Once the tree is trained, inference only needs to know which branch to take.

leaf_weight is similar, but for a leaf. It tells you how much weighted training data ended up in that leaf. It can be useful when inspecting the model, but prediction only needs leaf_value.

So the flat C# representation is intentionally small:

Feature
Threshold
Left
Right
Value
IsLeaf
Roots

That is basically a tiny runtime for executing tree bytecode. The original LightGBM dump has training metadata too, but the generated predictor only carries what it needs to reproduce inference.

The complete code

The full exporter is on GitHub: haiilong/export_lgbm_universal_cs.