Predefined functions Parallelization (e.g. IntelVectorMath functions)

It would be nice if we provide a macro that replaces functions with their vectorized version.

Like `@ivm @. sin(x)` would replace this with IntelVectorMath function, and `@applacc @. sin(x)` calls AppleAccelerate.

We can provide such macros from IntelVectorMath.jl too, or else maybe having all of them in one place like inside LoopVectorization.jl.

# @chriselrod quotes:

-------------------------
The major improvement these provide is that they're vectorized. If `x` is a scalar, then there isn't much benefit, if there is any at all.
Version of LoopVectorization provided an `@vectorize` macro (that has since been removed) which naively swapped calls, and made loops incremented (ie, instead of 1:N, it would be 1:W:N, plus some code to handle the remainder). `@avx` does this better.
 
If they are a vector, calling `@avx sin.(x)` or `IntelVectorMath.sin(x)` work (although a macro could search a whole block of code and swap them to use `IntelVectorMath`.

-------------------------

I've been planning on adding "loop splitting" support in LoopVectorization for a little while now (splitting one loop into several).
It would be possible to extend this to moving special functions into their own "loop" (a single vectorized call) and using VML (or some other library).

I would prefer "short vector" functions in general. Wouldn't require any changes to the library to support, nor would it require special casing. E.g, this works well with AVX2:
```julia
julia> using LinearAlgebra, LoopVectorization, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  2.131 μs (0 allocations: 0 bytes)
462.0132368439299

julia> @btime triangle_logdet($U)
  1.076 μs (0 allocations: 0 bytes)
462.0132368439296

julia> Float64(sum(log ∘ big, diag(U)))
462.0132368439296
```
Presumably, VML does not handle vectors with a stride other than 1, which would force me to copy the elements, log them, and then sum them if I wanted to use it there.
Assuming it's able to use some pre-allocated buffer...
```julia
julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  697.691 ns (0 allocations: 0 bytes)
462.0132368439296
```
It looks like all that effort would pay off, so I'm open to it.
Long term I would still be in favor of implementing more of these special functions in Julia or LLVM, but this may be the better short term move. I also don't see many people jumping at the opportunity to implement SIMD versions of special functions (myself included).

Too bad VML isn't more expansive. Adding it wouldn't do much to increase the number of special functions currently supported by SLEEFPirates/LoopVectorization.
I've been wanting a digamma function, for example. I'll probably try [the approach suggested by Wikipedia](https://en.wikipedia.org/wiki/Digamma_function#Computation_and_approximation).

How well does VML perform on AMD? Is that something I'd have to worry about?

EDIT:
With AVX512:
```julia
julia> using LinearAlgebra, LoopVectorization, IntelVectorMath, BenchmarkTools

julia> U = randn(200, 220) |> x -> cholesky(Symmetric(x * x')).U;

julia> function triangle_logdet(A::Union{LowerTriangular,UpperTriangular})
           ld = zero(eltype(A))
           @avx for i in 1:size(A,1)
               ld += log(A[i,i])
           end
           ld
       end
triangle_logdet (generic function with 1 method)

julia> @btime logdet($U)
  1.426 μs (0 allocations: 0 bytes)
463.5193875385334

julia> @btime triangle_logdet($U)
  234.677 ns (0 allocations: 0 bytes)
463.5193875385336

julia> Float64(sum(log ∘ big, diag(U)))
463.51938753853364

julia> y3 = similar(diag(U));

julia> function triangle_logdet_vml!(y, A::Union{LowerTriangular, UpperTriangular})
           @avx for i ∈ 1:size(A,1)
               y[i] = A[i,i]
           end
           IntelVectorMath.log!(y, y)
           ld = zero(eltype(y))
           @avx for i ∈ eachindex(y)
               ld += y[i]
           end
           ld
       end
triangle_logdet_vml! (generic function with 1 method)

julia> @btime triangle_logdet_vml!($y3, $U)
  411.110 ns (0 allocations: 0 bytes)
463.51938753853364
```
With AVX512, it uses [this](https://github.com/chriselrod/SLEEFPirates.jl/blob/master/src/sleef.jl#L1) log definition. I'd be more inclined to add something similar for AVX2. For this benchmark, [the Intel compilers produce faster code](https://github.com/chriselrod/LoopVectorization.jl/blob/master/docs/src/assets/bench_logdettriangle_v1.svg).

-------------------------

# My response:


I just wanted to clarify the thing I mean in this issue, so everyone is on the same page. 

We can consider 3 kinds of syntax for the macro (I use `@ivm` to avoid confusion):
1) A simple macro that only searches the given Expr for the functions that IntelVectorMath provides and adds `IVM.` before their name:
```julia
a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)
```

should be translated to:
```julia
IVM.sin(a) .* IVM.cos(a) .* sum.(a)
```

2) A macro that converts broadcast to IVM call (which I think is more inline with your example):
```julia
a = rand(100)
@ivm sin.(a) .* cos.(a)
```

which similar to 1 is translated to:
```julia
IVM.sin(a) .* IVM.cos(a)
```

But in this case other functions can use a `for` loop with `@avx` on them:
```julia
a = rand(100)
@ivm sin.(a) .* cos.(a) .* sum.(a)
```
should be translated to:
```julia
out = Vector{eltype(a)}(undef, length(a))

temp = IVM.sin(a) * IVM.cos(a) 
@avx for i=1:length(a)
  out[i] = temp * sum(a[i])
end
out
```

3) or similar to (2) but more efficient (probably). We can fuse the loops (internal IntelVectorMath loop and the for loop) together and use IntelVectorMath only for 1 element:
```julia
out = Vector{eltype(a)}(undef, length(a))
@avx for i=1:length(a)
  out[i] = IVM.sin(a[i])[1] * IVM.cos(a[i])[1] * sum(a[i])
end
out
```

When someone uses `@ivm` that means they want to transform `sin` to `IVM.sin`.
Multiple lib usage:
```julia
(@ivm sin.(a).*sin.(b)).*Base.sin.(a)
```

So which one is the syntax that we want to consider?

-----------------------------------


Related: 
- Scalar methods: https://github.com/JuliaMath/IntelVectorMath.jl/issues/44

Came up in: https://github.com/JuliaMath/IntelVectorMath.jl/issues/22#issuecomment-582059753, https://github.com/JuliaMath/IntelVectorMath.jl/issues/43, https://github.com/JuliaMath/IntelVectorMath.jl/pull/42, 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Predefined functions Parallelization (e.g. IntelVectorMath functions) #68

@chriselrod quotes:

My response:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Predefined functions Parallelization (e.g. IntelVectorMath functions) #68

Description

@chriselrod quotes:

My response:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions