ggml: x64: implement AMX dot product by ReinForce-II · Pull Request #7487 · ggml-org/llama.cpp

ReinForce-II · 2024-05-23T07:04:03Z

This PR introduces support for AMX(Advanced Matrix Extensions) kernel for the vector dot on the x64 architecture.
AMX is enabled if LLAMA_AMX=ON or LLAMA_NATIVE=ON is set in cmake on corresponding platforms.

It performs 16x16x16 of 4-byte packed dot product in bf16 instead of quantized vector dot.

Here are the performance measured on w9-3475x capped to 2.2ghz fixed frequency.

Q4_0

PR

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	21.45 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	6.83 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	42.74 ± 2.58
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	12.32 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp512	84.09 ± 7.45
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg128	20.50 ± 1.52

build: ba1987f2 (2975)

master

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	14.46 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	6.08 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	28.34 ± 1.19
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	9.75 ± 0.62
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp512	55.79 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg128	18.07 ± 1.35

build: cd93a28 (2972)

IQ4_XS

PR

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	pp512	18.95 ± 0.44
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	tg128	7.13 ± 0.35
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	pp512	37.58 ± 2.00
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	tg128	12.71 ± 0.73
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	pp512	74.06 ± 5.89
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	tg128	21.08 ± 1.91

build: ba1987f2 (2975)

master

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	pp512	9.28 ± 0.12
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	tg128	7.04 ± 0.29
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	pp512	18.03 ± 0.40
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	tg128	12.24 ± 0.87
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	pp512	34.11 ± 1.66
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	tg128	20.76 ± 1.84

build: cd93a28 (2972)

github-actions · 2024-05-23T09:21:45Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 554 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8403.85ms p(95)=22214.97ms fails=, finish reason: stop=505 truncated=49
Prompt processing (pp): avg=101.51tk/s p(95)=482.56tk/s
Token generation (tg): avg=46.61tk/s p(95)=46.65tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=amx_bf16 commit=0adedd712ed3959952db5147cbc271a2a42c2c7f

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 323.72, 323.72, 323.72, 323.72, 323.72, 812.89, 812.89, 812.89, 812.89, 812.89, 855.78, 855.78, 855.78, 855.78, 855.78, 841.23, 841.23, 841.23, 841.23, 841.23, 842.17, 842.17, 842.17, 842.17, 842.17, 830.67, 830.67, 830.67, 830.67, 830.67, 829.86, 829.86, 829.86, 829.86, 829.86, 859.79, 859.79, 859.79, 859.79, 859.79, 854.71, 854.71, 854.71, 854.71, 854.71, 866.11, 866.11, 866.11, 866.11, 866.11, 884.17, 884.17, 884.17, 884.17, 884.17, 883.72, 883.72, 883.72, 883.72, 883.72, 904.8, 904.8, 904.8, 904.8, 904.8, 920.54, 920.54, 920.54, 920.54, 920.54, 921.27, 921.27, 921.27, 921.27, 921.27, 927.47, 927.47, 927.47, 927.47, 927.47, 923.58, 923.58, 923.58, 923.58, 923.58, 939.86, 939.86, 939.86, 939.86, 939.86, 934.93, 934.93, 934.93, 934.93, 934.93, 935.08, 935.08, 935.08, 935.08, 935.08, 937.1, 937.1, 937.1, 937.1, 937.1, 935.25, 935.25, 935.25, 935.25, 935.25, 928.71, 928.71, 928.71, 928.71, 928.71, 938.33, 938.33, 938.33, 938.33, 938.33, 935.69, 935.69, 935.69, 935.69, 935.69, 934.61, 934.61, 934.61, 934.61, 934.61, 869.7, 869.7, 869.7, 869.7, 869.7, 868.76, 868.76, 868.76, 868.76, 868.76, 868.71, 868.71, 868.71, 868.71, 868.71, 872.61, 872.61, 872.61, 872.61, 872.61, 872.01, 872.01, 872.01, 872.01, 872.01, 872.04, 872.04, 872.04, 872.04, 872.04, 873.61, 873.61, 873.61, 873.61, 873.61, 887.33, 887.33, 887.33, 887.33, 887.33, 886.97, 886.97, 886.97, 886.97, 886.97, 889.18, 889.18, 889.18, 889.18, 889.18, 871.47, 871.47, 871.47, 871.47, 871.47, 871.0, 871.0, 871.0, 871.0, 871.0, 870.19, 870.19, 870.19, 870.19, 870.19, 873.04, 873.04, 873.04, 873.04, 873.04, 881.04, 881.04, 881.04, 881.04, 881.04, 861.57, 861.57, 861.57, 861.57, 861.57, 861.43, 861.43, 861.43, 861.43, 861.43, 858.55, 858.55, 858.55, 858.55, 858.55, 856.43, 856.43, 856.43, 856.43, 856.43, 856.9, 856.9, 856.9, 856.9, 856.9, 862.12, 862.12, 862.12, 862.12, 862.12, 861.08, 861.08, 861.08, 861.08, 861.08, 860.92, 860.92, 860.92, 860.92, 860.92, 865.03, 865.03, 865.03, 865.03, 865.03, 864.38, 864.38, 864.38, 864.38, 864.38, 867.89, 867.89, 867.89, 867.89, 867.89, 868.8, 868.8, 868.8, 868.8, 868.8, 872.0, 872.0, 872.0, 872.0, 872.0, 869.89, 869.89, 869.89, 869.89, 869.89, 870.34, 870.34, 870.34, 870.34, 870.34, 870.39, 870.39, 870.39, 870.39, 870.39, 868.62, 868.62, 868.62, 868.62, 868.62, 870.34, 870.34, 870.34, 870.34, 870.34, 873.45, 873.45, 873.45, 873.45, 873.45, 874.1, 874.1]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.27, 38.27, 38.27, 38.27, 38.27, 39.55, 39.55, 39.55, 39.55, 39.55, 31.61, 31.61, 31.61, 31.61, 31.61, 28.3, 28.3, 28.3, 28.3, 28.3, 28.93, 28.93, 28.93, 28.93, 28.93, 30.11, 30.11, 30.11, 30.11, 30.11, 32.08, 32.08, 32.08, 32.08, 32.08, 32.92, 32.92, 32.92, 32.92, 32.92, 33.35, 33.35, 33.35, 33.35, 33.35, 33.65, 33.65, 33.65, 33.65, 33.65, 33.88, 33.88, 33.88, 33.88, 33.88, 32.51, 32.51, 32.51, 32.51, 32.51, 32.38, 32.38, 32.38, 32.38, 32.38, 31.66, 31.66, 31.66, 31.66, 31.66, 30.67, 30.67, 30.67, 30.67, 30.67, 29.97, 29.97, 29.97, 29.97, 29.97, 29.95, 29.95, 29.95, 29.95, 29.95, 30.2, 30.2, 30.2, 30.2, 30.2, 30.04, 30.04, 30.04, 30.04, 30.04, 29.89, 29.89, 29.89, 29.89, 29.89, 29.38, 29.38, 29.38, 29.38, 29.38, 29.43, 29.43, 29.43, 29.43, 29.43, 29.8, 29.8, 29.8, 29.8, 29.8, 29.88, 29.88, 29.88, 29.88, 29.88, 29.89, 29.89, 29.89, 29.89, 29.89, 29.98, 29.98, 29.98, 29.98, 29.98, 30.21, 30.21, 30.21, 30.21, 30.21, 30.06, 30.06, 30.06, 30.06, 30.06, 30.32, 30.32, 30.32, 30.32, 30.32, 30.58, 30.58, 30.58, 30.58, 30.58, 30.73, 30.73, 30.73, 30.73, 30.73, 30.91, 30.91, 30.91, 30.91, 30.91, 30.99, 30.99, 30.99, 30.99, 30.99, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.54, 30.54, 30.54, 30.54, 30.54, 30.07, 30.07, 30.07, 30.07, 30.07, 30.29, 30.29, 30.29, 30.29, 30.29, 30.41, 30.41, 30.41, 30.41, 30.41, 30.51, 30.51, 30.51, 30.51, 30.51, 30.66, 30.66, 30.66, 30.66, 30.66, 30.61, 30.61, 30.61, 30.61, 30.61, 30.35, 30.35, 30.35, 30.35, 30.35, 29.49, 29.49, 29.49, 29.49, 29.49, 28.72, 28.72, 28.72, 28.72, 28.72, 28.63, 28.63, 28.63, 28.63, 28.63, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.58, 28.58, 28.58, 28.58, 28.58, 28.66, 28.66, 28.66, 28.66, 28.66, 28.72, 28.72, 28.72, 28.72, 28.72, 28.77, 28.77, 28.77, 28.77, 28.77, 28.75, 28.75, 28.75, 28.75, 28.75, 28.7, 28.7, 28.7, 28.7, 28.7, 28.59, 28.59, 28.59, 28.59, 28.59, 28.65, 28.65, 28.65, 28.65, 28.65, 28.81, 28.81, 28.81, 28.81, 28.81, 28.93, 28.93, 28.93, 28.93, 28.93, 29.01, 29.01, 29.01, 29.01, 29.01, 29.12, 29.12, 29.12, 29.12, 29.12, 29.17, 29.17]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.39, 0.39, 0.39, 0.39, 0.39, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.37, 0.37, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.28, 0.28, 0.28, 0.28, 0.28, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.49, 0.49, 0.49, 0.49, 0.49, 0.66, 0.66, 0.66, 0.66, 0.66, 0.58, 0.58, 0.58, 0.58, 0.58, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0]

teaalltr · 2024-10-29T21:56:45Z

@ReinForce-II any news on this?

ggerganov · 2024-10-30T12:15:58Z

I think this work is superseded by the recent #8998

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs ggml changes relating to the ggml tensor library for machine learning labels May 23, 2024

github-actions bot added the build Compilation issues label May 23, 2024

ReinForce-II added 4 commits May 27, 2024 10:15

basic implementation

3047229

use larger block size

9a16633

better toolchain compability

c812542

move unsed variable

0adedd7

ReinForce-II force-pushed the amx_bf16 branch from 21a44e9 to 0adedd7 Compare May 27, 2024 02:40

mingfeima closed this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: x64: implement AMX dot product#7487

ggml: x64: implement AMX dot product#7487
ReinForce-II wants to merge 4 commits intoggml-org:masterfrom
ReinForce-II:amx_bf16

ReinForce-II commented May 23, 2024

Uh oh!

github-actions bot commented May 23, 2024 •

edited

Loading

Uh oh!

teaalltr commented Oct 29, 2024

Uh oh!

ggerganov commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ReinForce-II commented May 23, 2024

Q4_0

PR

master

IQ4_XS

PR

master

Uh oh!

github-actions bot commented May 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

teaalltr commented Oct 29, 2024

Uh oh!

ggerganov commented Oct 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

github-actions bot commented May 23, 2024 •

edited

Loading