Conversation
|
|
||
| // Initialize level-3 blocksize objects with architecture-specific values. | ||
| // s d c z | ||
| bli_blksz_init_easy( &blkszs[ BLIS_MR ], 8, 6, -1, -1 ); |
There was a problem hiding this comment.
Hi, that's great to see neoverse n1 tuning. Can I ask you how you came up with these blocksize values ?
There was a problem hiding this comment.
Hi! To be honest I just wanted the compiler to generate tuned neoverse-n1 code with this patch so blocksize values were taken from thunderx2. If BLIS has a standard procedure to generate those value I am all up for it, please just let me know.
There was a problem hiding this comment.
I value what you did, however i don't have the answer for this.
@devinamatthews any pointer you could share ?
There was a problem hiding this comment.
@egaudry Do you think the fine tuning is essential to merge?
There was a problem hiding this comment.
Having a clear interface and arch detection makes sense indeed, however without proper tuning, mergers/reviewers might not see this as a priority.
Just guessing.
There was a problem hiding this comment.
Jeff Diamond has better tuning parameters for N1.
There was a problem hiding this comment.
@jeffhammond Thanks for commenting. Could you please point me to Jeff Diamond so I could ask him if he is able to share his parameters please?
"The establishment" here. @everton1984 thanks for your work but @egaudry is pretty much right; it is best to have specifically-tuned block sizes and/or kernels with performance numbers before creating a new sub-configuration. Otherwise it is just easier to use the thunderx2 subconfig directly. I'll ask Jeff Diamond on the status of the tuned N1 parameters since that code may still be in the clutches of Oracle's lawyers. |
@devinamatthews Thanks for answering. No problem it makes sense, I can generate the parameters just wanted to know before trying something ad-hoc if there is a particularly defined procedure to obtain them. |
|
The block sizes can, to some extent, be determined analytically, see https://www.cs.utexas.edu/users/flame/pubs/TOMS-BLIS-Analytical.pdf. A basic non-analytical strategy is:
Final note: The block sizes must satisfy MC%MR == 0 and NC%NR == 0. If possibly it doesn't hurt to have all three cache block sizes as multiples of both MR and NR unless this choice is too restrictive. It may also help to avoid large powers of 2. |
@devinamatthews Thanks a lot! Let me find the correct parameters then. |
This PR adds a valid Arm Neoverse N1 compilation target using Armv8 kernels. It creates the appropriate registry information and can autodetect a N1 cpu.