Skip to content

Memory leak still present in VW 9.11 C# bindings #4900

@michael-celani

Description

@michael-celani

Describe the bug

It looks like VW still has issues leaking memory in its C# bindings on the latest release.

My batch training processing script looks like this:

private async Task TrainBatchAsync(
        SlimDeckModel[] batch, 
        Dictionary<ColorIdentity, Dictionary<Guid, ColorIdentityCardCounts>> popularities,
        CancellationToken stoppingToken)
    {
        GCSettings.LargeObjectHeapCompactionMode = GCLargeObjectHeapCompactionMode.CompactOnce;
        GC.Collect(GC.MaxGeneration, GCCollectionMode.Aggressive);
        GC.WaitForPendingFinalizers();

        Stopwatch stopwatch = new();

        var exists = File.Exists(Options.Value.VowpalWabbitPath);
        var args = Options.Value.VowpalWabbitArgs + $" -f {Options.Value.VowpalWabbitPath}";
        if (exists)
        {
            Logger.LogInformation("Existing model found, loading.");
            args += $" -i {Options.Value.VowpalWabbitPath}";
        }
        else
        {
            Logger.LogInformation("No existing model found, starting fresh.");
        }

        using var model = new VowpalWabbit(args);

        foreach (var deck in batch)
        {
            var ciCounts = popularities[deck.ColorIdentity];

            if (deck.LastUpdated > UpdatedTime) UpdatedTime = deck.LastUpdated;

            stopwatch.Restart();
            var context = new RecommenderContext(model, deck, ciCounts);
            context.Learn(model, ciCounts);
            stopwatch.Stop();

            stoppingToken.ThrowIfCancellationRequested();

            TotalMs += stopwatch.ElapsedMilliseconds;
            Count++;

            if (Count % 1000 == 0 || Count == Decks.Count) LogProgressUpdate(Count, AvgMs, TotalDecks, TimeSpan.FromMilliseconds(AvgMs * (TotalDecks - Count)));
        }

        Logger.LogInformation("Saving model after pass.");
        model.EndOfPass();
    }

For additional context, this is the relevant code of RecommenderContext:

using Celani.Magic.Model;
using Celani.Magic.Tools.Common.Extensions;
using MathNet.Numerics;
using MathNet.Numerics.Interpolation;
using System.Collections.Frozen;
using VW;
using VW.Labels;

namespace Celani.Magic.Tools.Learning;

public class RecommenderContext
{
    public void Learn(VowpalWabbit vw, Dictionary<Guid, ColorIdentityCardCounts> colorIdentityCounts)
    {
        HashSet<Guid> usedCardIds = [];

        // Learn from the deck:
        foreach (var (id, hash, value, popularity) in CardHashes)
        {
            usedCardIds.Add(id);

            var weight = Math.Pow(1.0 - popularity, 0.5);

            var label = new SimpleLabel
            {
                Label = 1,
                Weight = (float) Math.Clamp(weight, 0.05, 1.0)
            };

            using var example = BuildExample(vw, id, label);
            vw.Learn(example);
        }

        // Take popular cards from the same color identity that haven't been used yet:
        var popularCi = colorIdentityCounts
            .Take(1000)
            .Where(cc => !usedCardIds.Contains(cc.Key))
            .Shuffle(cc => Math.Pow(cc.Value.Count, 0.75))
            .Select(cc => cc.Key)
            .Take(CardHashes.Length)
            .ToList();

        foreach (var cardId in popularCi)
        {
            usedCardIds.Add(cardId);

            var label = new SimpleLabel
            {
                Label = -1.0f,
                Weight = 1.0f
            };

            using var example = BuildExample(vw, cardId, label);
            vw.Learn(example);
        }

        var randomCi = colorIdentityCounts
            .Where(cc => !usedCardIds.Contains(cc.Key))
            .Shuffle(cc => 1.0)
            .Select(cc => cc.Key)
            .Take(CardHashes.Length / 4)
            .ToList();

        foreach (var cardId in randomCi)
        {
            usedCardIds.Add(cardId);

            var label = new SimpleLabel
            {
                Label = -1.0f,
                Weight = 1.0f
            };

            using var example = BuildExample(vw, cardId, label);
            vw.Learn(example);
        }
    }

    private VowpalWabbitExample BuildExample(VowpalWabbit vw, Guid candidate, ILabel? label = null)
    {
        using var exampleBuilder = new VowpalWabbitExampleBuilder(vw);

        using (var ns = exampleBuilder.AddNamespace(VWHashes.AllCardsNamespace))
        {
            // Deck cards:
            foreach (var (id, hash, value, _) in CardHashes)
            {
                if (id == candidate) continue;
                ns.AddFeature(hash, value);
            }

            var commanderWeight = CardHashes.Length > 90 ? 0.1f : 
                (float) Interpolation.Interpolate(CardHashes.Length);

            // Commander cards:
            foreach (var (id, hash) in CommanderHashes)
            {
                if (id == candidate) continue;
                ns.AddFeature(hash, commanderWeight);
            }
        }

        // Card to predict:
        using (var ns = exampleBuilder.AddNamespace(VWHashes.CardNamespace))
        {
            ns.AddFeature(vw.HashFeature($"c_{candidate}", VWHashes.CardNamespaceHash), 1);
        }

        if (label is not null)
        {
            exampleBuilder.ApplyLabel(label);
        }

        return exampleBuilder.CreateExample();
    }
}

Here is the memory usage pattern of this code. Each write is followed by a garbage collection, VW is disposed and reopened:

Image

VW is disposed every batch in an effort to force the system to release the memory taken up by examples. I've confirmed that only one example is actually in the batch pool since it's reused and this training is not in parallel, so I don't know what the source of the actual problem is.

In the last memory usage spike of the graph, I restarted the program entirely. It still loaded the model I was building, but its memory usage reset to the baseline of the original pass.

If it's true that the amount of memory in a VW model is bounded, then I would expect to see its memory usage reset to a baseline over time, especially since I'm forcing an aggressive garbage collection every batch. However, that doesn't seem to be the case, and it seems to linearly scale up over time. Notably, there's a large jump in memory usage the first time the file is written and then reopened.

The model does get bigger with each batch, but not to the point that it should use this much memory.

How to reproduce

Train a model using the given arguments in the C# bindings, save over passes using EndOfPass(), then dispose and reopen the same model:

--link logistic --loss_function logistic --interactions ac -b 28 --progress 1000 --holdout_off -i file.vw -f file.vw

Version

9.11

OS

Linux

Language

C#

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugBug in learning semantics, critical by default

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions