Conversation
e5ea030 to
520e9cb
Compare
520e9cb to
c473719
Compare
017da68 to
f8b6f8a
Compare
|
This should be working and there is minor to no perf penalty. So I'll move it out of draft. It's not a super elegant solution but I don't see a better way.. open to ideas. |
angeloskath
left a comment
There was a problem hiding this comment.
Just to be clear, assuming simplify and fuse are disabled, previously we would reuse the computation of b since both compiles would be using the same graph but now the computation will be done twice.
Personally I think this is completely fine and a pretty good solution actually. If a user is so conscious about the evaluation of these constant they need only evaluate them manually before compiling and the computation will be shared.
A minor caveat in which case one would need to pre-evaluate some constants would be something of the following form:
c = mx.ones((1024, 1024, 1024)) * 4
fs = [mx.compile(lambda x: x * c * i) for i in range(10)]which would keep a copy of c per function in fs (after it is called once) while previously it wouldn't.
Let me know if I am missing something.
I don't think that's the case. Previously we would still recompute the captured part of the graph for both because when you actually evaluate the compiled tape there is no short-circuit for already evaluated subparts of the graph (maybe there should be?).
I think for your example it would also recompute In Python we can attempt to fix it by inspecting the closure and treating captured arrays as implicit inputs. (I had something implemented like that a while back if you recall, but it gets a bit messy). In C++ I don't think we can distinguish between constants created inside the compiled function vs captured inputs without some serious shenanigans. |
A possible solution to #2674.
Basically deep copy the graph before applying optimizations in
compile_simplify. Would like to measure overhead from this before merging it.I don't see any noticeable difference in compile times on the MNIST 100-layer benchmark.