ChangesetReader performance improvements

Had a brief chat with the Jobs and Transformer people and checked our new "happy path" for getting changes instance IDs, I found a few issues that we should tackle because they cost performance. Not all of them confirmed, the biggest concern is about crossing the N-API boundary too often.

Details:

Consumers like the iModel transformer and the imodels-jobs changed-elements service only need the
**instance keys** of changed rows - `{ ECInstanceId, ECClassId, op }` - not the decoded property values.
The new `ChangesetReader` (native `PreparedChangesetReader`, TS wrapper in
`core/backend/src/ChangesetReader.ts`) already supports this cheaply via `PropertyFilter.InstanceKey`,
which short-circuits `ChangesetValueFactory::Create` before `BuildPropertyFields`:

```cpp
fields.emplace_back(std::move(instanceIdField));
fields.emplace_back(std::move(classIdField));
if (propertyFilter == PropertyFilter::InstanceKey)
    return SUCCESS; // skip user-property materialization
```

That removes the expensive value-decoding work (nav props, DateTime conversion, structs, arrays, GUID
/geometry). The remaining cost is **not** decoding - it is per-row overhead, dominated by N-API boundary
crossings and repeated PRAGMA lookups. For a workload of hundreds of changesets x millions of rows this
fixed cost dominates.

## Problem 1 (primary): per-row N-API chatter

The TS `step()` wrapper crosses the native boundary **3-4 times per row**:

```ts
public step(): boolean {
  ...
  if (this._nativeReader.step()) {                                  // crossing #1
    const meta = this._nativeReader.getChangeMetadata();           // crossing #2 (builds a JS object)
    ...
    if (op === "Inserted" || op === "Updated")
      const rowValue = this._nativeReader.getValue(New, opts);     // crossing #3 (builds {data,key,changeFetchedPropNames})
    if (op === "Deleted" || op === "Updated")
      const rowValue = this._nativeReader.getValue(Old, opts);     // crossing #4 (Update only)
  }
}
```

- Insert -> 3 crossings
- Delete -> 3 crossings
- Update -> 4 crossings

Each of `getChangeMetadata`/`getValue` not only crosses but **allocates and marshals a transient napi
object** (`getValue` returns `{ data, key, changeFetchedPropNames }`, where `data` is itself an object and
`changeFetchedPropNames` an array). For a 1M-row changeset that is ~3-4M boundary crossings plus ~2-3M
throwaway JS objects. The N-API boundary is the known-expensive part here, and we pay it several times per
row.

### Proposed fix: batch/chunk results like `ECSqlReader` / ConcurrentQuery's `QueryReader`

`ECSqlReader` never crosses per row. It drains a local page buffer (`_localRows`) and only refills it when
empty (`fetchRows()` -> `readRows()`), so the boundary cost is amortized across a whole page:

```ts
// ECSqlReader.ts (paraphrased)
private _localRows: any[] = [];
public async step(): Promise<boolean> {
  if (this._localOffset < this._localRows.length - 1) {
    ++this._localOffset;                 // pure local advance, NO crossing
  } else {
    await this.fetchRows();              // page exhausted -> ONE transfer for the next chunk
    this._localOffset = 0;
  }
}
```

Note: ConcurrentQuery's page size is **quota-sized**, not a fixed row count - the backend returns a
`Partial` status with as many rows as fit the time/memory quota, and the reader keeps requesting until
`Done`. The transfer is also request/response (RPC/IPC on the frontend, native boundary on the backend)
rather than strictly N-API. The relevant property is what we want to copy: **one transfer per chunk, not
per row.** For the changeset batch API below we can pick a fixed `maxRows` instead of a quota.

Mirror that for the lightweight changeset path. Add a native **batch drain** that advances the reader and
returns up to `maxRows` instance keys in a single crossing, with the TS wrapper iterating the local buffer:

```ts
// native addon
nextInstanceKeys(maxRows: number): ChangesetInstanceKeyBatch | undefined;

interface ChangesetInstanceKeyBatch {
  // option A: array of small objects
  rows: { id: Id64String; classId: Id64String; op: SqliteChangeOp }[];
  // option B (cheaper, preferred for millions of rows): columnar typed buffers
  //   ids: BigUint64Array;       // ECInstanceId
  //   classIds: BigUint64Array;  // ECClassId
  //   ops: Uint8Array;           // opcode per row
  done: boolean;                  // true when the underlying walk is exhausted
}
```

- **Columnar (option B)** is the cheapest: it avoids per-row JS object allocation entirely and packs the
  data into a couple of typed arrays, which marshal across N-API as buffers. This is the shape that makes
  "millions of rows" genuinely cheap.
- Under `PropertyFilter.InstanceKey` the native side already has `ECInstanceId` + `ECClassId` materialized
  as fixed slots after `Create`, so assembling the batch is just reading those two slots + the opcode per
  row - no extra decode.
- TS wrapper keeps a `_localBuffer`, drains it, and only calls `nextInstanceKeys()` when empty. Result:
  **one crossing per batch** (e.g. a fixed `maxRows` of a few thousand) instead of 3-4 per row.

This is purely additive - the existing per-row `step()`/`getValue()` API is untouched for consumers that
need full row data.

## Problem 2: `PRAGMA_TABLE_INFO` runs 2x per stage, every row

`PreparedChangesetReader::GetColumnValues` re-derives static table metadata on every row:

```cpp
GetColumnCountForCurrentChangedTable(...);  // SELECT COUNT(*) FROM PRAGMA_TABLE_INFO(?)
... "SELECT [name] FROM PRAGMA_TABLE_INFO(?) ORDER BY [cid]"
```

The statements are cached but **executed** every call. Column count and names for a given table are
constant for the entire walk. An UPDATE processes both stages, so that is **4 PRAGMA scans per row**
re-computing constant metadata. Cache column count + ordered names keyed by table name (populated lazily on
first encounter) to eliminate this.

## Problem 3: redundant Old-stage processing on UPDATE for instance keys

`ReFetchValues` builds both New and Old stages for an UPDATE. For `PropertyFilter.InstanceKey` the key is
identical in both stages (PK and ClassId do not change on update), so the entire Old-stage pass
(another `GetColumnValues` + 2 PRAGMA + `ResolveClassId` + `Create`) plus the extra `getValue(Old)`
crossing produces no new information. Under InstanceKey, skip Old-stage processing for updates (or have the
batch drain emit a single key per update row).

## Minor: full column map built even for InstanceKey

`GetColumnValues` maps every present column into an `unordered_map<Utf8String, DbValue>` per stage, when
InstanceKey only needs the PK + ClassId columns. No value decode happens, but it is O(columns) string-keyed
allocations per row. Lower priority than 1-3, but relevant at scale.

## Priority

1. **Batch/chunk drain (Problem 1)** - biggest win, removes the per-row N-API cost that dominates the
   lightweight path. Model after `ECSqlReader._localRows` paging; prefer columnar typed buffers.
2. **PRAGMA caching (Problem 2)** - cheap native-side fix, removes 2-4 redundant pragma scans per row.
3. **Skip Old stage on UPDATE for InstanceKey (Problem 3)** - halves per-update work for key-only
   consumers.

All are native-side and backward compatible; none change the existing `ChangesetReader` API surface.

## Files referenced

- `itwinjs-core/core/backend/src/ChangesetReader.ts` - `step()`, `getValue` wrapper
- `itwinjs-core/core/common/src/ECSqlReader.ts` - `_localRows` paging pattern to mirror
- `imodel-native/iModelJsNodeAddon/IModelJsNative.cpp` - `NativeChangesetReader` (step/getValue/getChangeMetadata bindings)
- `imodel-native/iModelCore/ECDb/ECDb/PreparedChangesetReader.cpp` - `Step`, `ReFetchValues`, `GetColumnValues`, `GetInstanceKey`
- `imodel-native/iModelCore/ECDb/ECDb/ChangesetValueFactory.cpp` - `Create` (InstanceKey short-circuit)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ChangesetReader performance improvements #9415

Problem 1 (primary): per-row N-API chatter

Proposed fix: batch/chunk results like `ECSqlReader` / ConcurrentQuery's `QueryReader`

Problem 2: `PRAGMA_TABLE_INFO` runs 2x per stage, every row

Problem 3: redundant Old-stage processing on UPDATE for instance keys

Minor: full column map built even for InstanceKey

Priority

Files referenced

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ChangesetReader performance improvements #9415

Description

Problem 1 (primary): per-row N-API chatter

Proposed fix: batch/chunk results like ECSqlReader / ConcurrentQuery's QueryReader

Problem 2: PRAGMA_TABLE_INFO runs 2x per stage, every row

Problem 3: redundant Old-stage processing on UPDATE for instance keys

Minor: full column map built even for InstanceKey

Priority

Files referenced

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposed fix: batch/chunk results like `ECSqlReader` / ConcurrentQuery's `QueryReader`

Problem 2: `PRAGMA_TABLE_INFO` runs 2x per stage, every row