Skip to content

Conversation

@sz3
Copy link
Owner

@sz3 sz3 commented Sep 28, 2025

The flip side of #144: complex-ify the encoder. 🫠

We (users of the cimbar.js encoder, including me) have run into situations where the browser hits an out of memory (OOM) error on large input files. Example issues:
#141
#132
#118
#96

While I can't solve all OOM issues -- wirehair needs the compressed representation stored in contiguous memory -- writing the decoder (#134) reminded/convinced me that we can do better when loading the uncompressed data for encoding.

So:

  • we're now loading input files in chunks
  • this requires us to change the c api for encoding
  • as a result, OOM errors in the browser should be less frequent. I won't say solved; but, better.

The new API surface:

  • cimbare_init_encode(filename, encode_id) -> starts an encode, waits for data
  • cimbare_encode(data) -> provides data. When the size of data provided is == cimbare_encode_bufsize(), we treat this as an incremental write and don't finalize anything (and return 1). Otherwise, we consider the write complete.
  • cimbare_encode(nullptr, 0) therefore functions as a sort of "flush()" operation if you want to make sure the data is ready.
  • when cimbare_encode() returns 0, we can start doing our cimbare_next_frame() calls

sz3 added 2 commits September 28, 2025 08:45
The gist is that we currently have to read the whole file into
contiguous memory before handing it off the API. *Additionally*, as a
hard requirement for wirehair, the *compressed* file needs to exist in
contiguous memory.

You might think (as I did) that ~15 MB would not be enough to run afoul
of the browser memory constraints, but apparently this is not so.

We can't do anything about the second (post compressed) requirement, but
the first can be relaxed if the API accepts files in chunks. So:

* cimbare_init_encode() <-- to initialize the encode. Provide filename,
set encode_id (or -1 to auto-increment)
* cimbare_encode_bufsize() <-- the desired chunk size. Currently a
constant.
* cimbare_encode() <-- to provide data. When the data size != the chunk
size, the encode is ready. (this also means we can still achieve the old
behavior of providing the whole file at once)

cimbar_encode(null, 0) can act as a flush() to make sure the encode is
ready -- though this is only relevant when the filesize is a multiple of
the chunk size.
…file

Should help avert (some) OOM errors, hopefully
}
else if (res == 1 and cimbare_encode(nullptr, 0) != 0) // fallback finish encode
{
std::cerr << "failed to encode for file" << filename << std::endl;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example flow.

++_encodeId; // increment _encodeId every time we change files
else
enc.set_encode_id(static_cast<uint8_t>(encode_id));
_encodeId = encode_id;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be the strangest side effect of this change -- _encodeId will now persist if you override it, so you can change the auto-increment starting point. You probably shouldn't, though. The right behavior will generally be to use -1 (for auto-increment).

(should probably define a constant for that)

std::optional<cv::Mat> _next;

// compressing the file
std::unique_ptr<cimbar::zstd_compressor<std::stringstream>> _comp;
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The downside here is that the compression is now stateful. But it must be, given that want to persist state across calls.

_comp->set_compression_level(_compressionLevel);

if (fnsize > 0 and filename != nullptr)
_comp->write_header(filename, fnsize);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this "new" code is refactored from Encoder::create_fountain_encoder() -- that may go away at some point, now that it seems redundant.

_comp->pad(fountainChunkSize - compressedSize + 1);

// create the encoder stream
_fes = fountain_encoder_stream::create(*_comp, fountainChunkSize, _encodeId);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More from Encoder::create_fountain_encoder()


// create the encoder stream
_fes = fountain_encoder_stream::create(*_comp, fountainChunkSize, _encodeId);
_comp.reset();
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we return 0 from cimbare_encode, the next call to it will return -1 (as if we never called cimbare_init_encode)

std::string filename = "/tmp/foobar-c语言版.txt";
assertEquals( 0, cimbare_encode(reinterpret_cast<unsigned char*>(contents.data()), contents.size(), filename.data(), filename.size(), 100) );
assertEquals( 0, cimbare_init_encode(filename.data(), filename.size(), 100) );
assertEquals( 0, cimbare_encode(reinterpret_cast<unsigned char*>(contents.data()), contents.size()) );
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another usage example. In this case, we encode the whole file at once, as the old api used to require.

if (datalen > 0) {
// copy to wasm buff and write
const uint8View = new Uint8Array(event.target.result);
_compressBuff.set(uint8View);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads awkwardly due to JS apis being awkward. But all we're doing here is copying bytes somewhere the wasm code can work with them.

// this null call is functionally a flush()
// so a no-op, unless it isn't
const nullBuff = new Uint8Array(Module.HEAPU8.buffer, _compressBuff.byteOffset, 0);
Main.encode_bytes(nullBuff);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the flush/no-op path for cimbare_encode()

Let's have cimbare_init_encode() reset the fountain stream as well.

If we don't, there's a weird (failure path) state where it looks like
we've switched the encoder to a new file, but every encoded frame will
be for the previous file (since the fountain stream setup occurs during
the final cimbare_encode() call.
assertEquals( 0, cimbare_encode(nullptr, 0) );
assertEquals( 1, cimbare_next_frame() );

assertEquals( -1, cimbare_encode(nullptr, 0) );
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstrating the corner case.

@sz3 sz3 merged commit 98fca1c into master Sep 29, 2025
11 checks passed
@sz3 sz3 deleted the improve-encode-capi branch September 29, 2025 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants