Skip to content

Conversation

@jerch
Copy link
Member

@jerch jerch commented Oct 8, 2018

A first attempt to recycle buffer lines.

Can be tested by switching experimentalBufferLineImpl to TypedArray.

Note: The recycling is only active for the typed array version to make it a fair comparison (the JS array version is much slower with it). Still buggy with faulty behavior in some tests (cannot be tested easily atm, the tests currently rely on the hardcoded JS array type). More to come...

The recycling is now active for both buffer line versions.

Part of #791

@jerch jerch added the work-in-progress Do not merge label Oct 8, 2018
@jerch jerch force-pushed the reuse_bufferlines branch from ceb955b to 33a3580 Compare October 9, 2018 21:58
@jerch
Copy link
Member Author

jerch commented Oct 9, 2018

@Tyriar Here comes the new approach: a callback version of push. It was the fastest thing I could come up with. I first tried your trim idea with another method, but it was much worse in runtime.

Still up for other ideas.

NB: I kinda screwed up the branch and had to reset it lol.

@jerch
Copy link
Member Author

jerch commented Oct 10, 2018

Added an experimentalPushRecycling option, this way it can be tested in the demo with both bufferline implementations. Also made a small change to BufferLine.copyFrom, the JSArray version now profits from the recycling, too.

For my typical benchmark ls -lR /usr/lib I see the following runtime numbers (range from 5 runs):

  • no recycling:

    • JSArray: 1900 - 2300 ms (corresponds to current master)
    • TypedArray: 2200 - 2600 ms
  • with recycling:

    • JSArray: 1600 - 2000 ms
    • TypedArray: 1700 - 2100 ms

With that small change in BufferLine.copyFrom the JSArray version is again the fastest. The TypedArray version show a greater benefit from recycling and is close behind (~10% faster than JSArray without recycling, still ~5% slower than JSArray with recycling).

Edit: The numbers above are the total JS runtime for my benchmark which also contains the renderer runtime and the websocket overhead (unbuffered in the server script). Currently I have no isolated input only test setup, but we can approximate the boost by subtracting these numbers:

  • cost of unbuffered websocket overhead ~300 ms (tends to be bigger though, also causes the big ranges above)
  • cost of renderer ~700 ms (450ms for drawImage + some JS calls)

Final speedup for the input handler code average:

  • JSArray: 1100 vs. 800 => ~27% faster
  • TypedArray: 1400 vs. 900 => ~35% faster

Speedup master vs. recycled TypedArray: 1100 vs. 900 => ~18%

Guess I should do a real input chain benchmark to get more reliable numbers.

@jerch
Copy link
Member Author

jerch commented Oct 10, 2018

Here are the numbers for the input chain alone (done with this script https://gist.github.com/jerch/31f23538c5ca1a5079a78bbd627398ce, ./benchmark_data1 contains the output of ls -lR /usr/lib):

{ BufferLineType: 'JsArray',
  Recycling: false,
  Throughput: '10.45 MB/s',
  File: './benchmark_data1',
  Duration: 4596,
  Size: 50361113 }
{ BufferLineType: 'TypedArray',
  Recycling: false,
  Throughput: '11.44 MB/s',
  File: './benchmark_data1',
  Duration: 4199,
  Size: 50361113 }
{ BufferLineType: 'JsArray',
  Recycling: true,
  Throughput: '12.04 MB/s',
  File: './benchmark_data1',
  Duration: 3990,
  Size: 50361113 }
{ BufferLineType: 'TypedArray',
  Recycling: true,
  Throughput: '19.15 MB/s',
  File: './benchmark_data1',
  Duration: 2508,
  Size: 50361113 }

Seems the typed array + recycling doubles the throughput 😄

Question here is what to make out of these numbers and why they differ that much from the numbers in the browser:

  • Testing of real data from the pty is abit wonky in the browser and leads to unreliable numbers due to the websocket between (not clue why it behaves in such an undeterministic way, it causes all types of hiccups in browser tests)
  • Now the typed array version is always faster: Kinda what I expected in the first place due to reduced GC pressure, but the browser always said - nope its slower, hmm. Maybe this is related to the rendering in browser, the green boxes look quite different and show higher frame rates most of the time for the typed array.
  • Maybe v8 in nodejs does some fundamentally different things than chrome's v8 - Imho unlikely but not impossible.
  • We have not yet sliced xterm.js into a separate offscreen working part, the numbers could be totally wrong due missing DOM and internal errors - Yeah well the data were eaten without errors. I also tested data files with more complicated stuff like midnight commander startup - it wont run.

Perfwise I think the numbers from the nodejs tests are closer to the truth - they only measure the input chain thats affected most by the changes while the browser tests seem to measure some side effects as well.

@jerch
Copy link
Member Author

jerch commented Oct 22, 2018

@Tyriar Did a version without the callback as you suggested (see last commit). Well, seems it has several flaws:

  • needs a precheck on caller side whether trim would occur, imho thats bad API design as it leaks the trim functionality and even will lead to wrong results if the check was forgotten or is faulty
  • its actually slower than the callback variant (14 vs. 17 MB/s) With slight code changes its faster (~19 MB/s).

I dont like the callback thing either here, its just I had no better idea, how to put it without exposing to many internals. Still up for other ideas.

Edit: Woops forgot to push, see commit below.

@jerch jerch closed this Oct 26, 2018
@jerch jerch reopened this Oct 26, 2018
@jerch
Copy link
Member Author

jerch commented Oct 26, 2018

@Tyriar Removed the callback variant, trimAndRecycle is alot faster. Drawback is the precondition to use it only for a full circular list or hell will break loose. I commented it accordingly, so we should be on the safe side as long as we dont neglect the comments/doc.

Now it works as follows for recycling:

  • a terminal instance holds a blankline blueprint (a buffer line instance with blankLine content)
  • the blueprint is read upon entering Terminal.scroll and compared to .cols and current attrs, if they differ the blueprint is recreated
  • any new line is fast cloned from this blueprint (newLine = blueprint.clone();)
  • if the buffer is at max length, trimAndRecycle steps in, sets the old trimmed line as new active one and returns it, this line gets the cell content from the blueprint by fast copy (trimAndRecycle().copyFrom(blueprint);)
  • done

The blueprint is needed to have something to copy the cell content from without recreating a line everytime. The blueprint alone already gives a nice speedup since the cached blankline is unlikely to change very often.
Overall the benchmark numbers for the input chain are:

  • JSArray no recycling: 7 - 9 MB/s
  • JSArray with recycling: 9 - 11 MB/s
  • TypedArray no recycling: 10 - 12 MB/s
  • TypedArray with recycling: 18 - 20 MB/s

Up for another review.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty solid, just a few things and I think we can merge this!

@jerch
Copy link
Member Author

jerch commented Oct 30, 2018

@Tyriar Something like this should work:

  public pushFromBlueprint(blueprint: IBufferLine, allowRecycle?: boolean): void {
    if (this._length === this._maxLength) {
      this._startIndex = ++this._startIndex % this._maxLength;
      this.emit('trim', 1);
      if (allowRecycle) {
        (this._array[this._getCyclicIndex(this._length - 1)] as unknown as IBufferLine).copyFrom(blueprint);
      } else {
        this._array[this._getCyclicIndex(this._length - 1)] = blueprint.clone() as unknown as T;
      }
    } else {
      this._array[this._getCyclicIndex(this._length)] = blueprint.clone() as unknown as T;
      this._length++;
    }
  }

But this has several issues:

  • basically removes the type system
  • BufferLine impl is pulled into CircularList
  • also pulls the blueprint thing into CircularList
  • not sure if this.emit('trim', 1); is at the right position (same goes for current push)

Note it is not possible to give push the final line object beforehand (cloned or recycled), since push has to decide first whether to recycle at all (the final object is either a clone or a copyFrom recycled object). To encapsulate this nicely we have imho only 2 options - either the callback variant or give up the generic <T> and do an inherited version of CircularList for IBufferLine to clean up the type system above.

@jerch
Copy link
Member Author

jerch commented Oct 30, 2018

@Tyriar The last commit reverts to a cleaner precheck version and does the recycling explicitly in Terminal.scroll. This seems much cleaner to me than trying to merge it with push.
Slightly slower than pushAndRecycle, but less hazardous to use.

@jerch jerch closed this Oct 30, 2018
@jerch jerch reopened this Oct 30, 2018
@jerch
Copy link
Member Author

jerch commented Nov 7, 2018

@Tyriar The last commit is a compromise between speed and code safety. Was not able to remove the double checking of (this._length === this._maxLength) for the recycle control flow path. recycle now always works but returns undefined for non full ringbuffers, still the full precheck is faster than handling the undefined (due to deopt in the ringbuffer when reading out of bound).
Speed decreased only slightly (18.5 MB/s vs 17.5 MB/s seems neglectible).

@jerch jerch self-assigned this Nov 15, 2018
@jerch jerch added this to the 3.9.0 milestone Nov 15, 2018
@Tyriar Tyriar changed the title first attempt to recycle buffer lines Recycle buffer lines Nov 18, 2018
@jerch jerch force-pushed the reuse_bufferlines branch from 9ee0b51 to 895d6fc Compare November 20, 2018 21:05
@jerch
Copy link
Member Author

jerch commented Nov 20, 2018

Averages now to ~18.5 MB/s, this is a nice enhancement compared to ~7.5 MB/s in the current master.

Copy link
Member

@Tyriar Tyriar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@jerch jerch merged commit b4faaef into xtermjs:master Nov 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants