LLVM / project / 1008539 / [SLP]Reduce number of alternate instruction, where possible

LLVM/project 1008539 — llvm/include/llvm/Analysis TargetTransformInfo.h TargetTransformInfoImpl.h, llvm/lib/Analysis TargetTransformInfo.cpp

main

[SLP]Reduce number of alternate instruction, where possible

Previous version was reviewed here https://github.com/llvm/llvm-project/pull/123360
It is mostly the same, adjusted after graph-to-tree transformation

Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                results     results0    diff
                       test-suite :: MultiSource/Benchmarks/Olden/tsp/tsp.test     2788.00     2820.00  1.1%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   278168.00   280904.00  1.0%
               test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    82682.00    83258.00  0.7%
                    test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test   139344.00   139712.00  0.3%
     test-suite :: MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.test    27149.00    27197.00  0.2%
               test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1008188.00  1009948.00  0.2%
          test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    39226.00    39290.00  0.2%
   test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    39229.00    39293.00  0.2%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2074533.00  2076549.00  0.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2074533.00  2076549.00  0.1%
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   798440.00   798952.00  0.1%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44123.00    44139.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   318942.00   319038.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  1159880.00  1160152.00  0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    73595.00    73611.00  0.0%
                test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  1146124.00  1146348.00  0.0%
       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   203831.00   203847.00  0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   207662.00   207678.00  0.0%
                test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test   589851.00   589883.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  1398543.00  1398559.00  0.0%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  1398543.00  1398559.00  0.0%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2050990.00  2051006.00  0.0%

      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12559687.00 12559591.00 -0.0%
                     test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test  3074157.00  3074125.00 -0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1092252.00  1092188.00 -0.0%
            test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test   779763.00   779715.00 -0.0%
         test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   253517.00   253485.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test   848259.00   848035.00 -0.0%
     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test    93064.00    93016.00 -0.1%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383747.00   383475.00 -0.1%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   673051.00   662907.00 -1.5%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   673051.00   662907.00 -1.5%

Olden/tsp - small variations
Prolangs-C/TimberWolfMC - small variations, some code not inlined
FreeBench/pifft - extra store <8 x double> vectorized, some other extra
vectorizations
CFP2006/433.milc - better vector code
FreeBench/fourinarow - better vector code
Benchmarks/tramp3d-v4 - extra vector code, small variations
mediabench/gsm/toast - small variations
MiBench/telecomm-gsm - small variations
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - better vector code, small variations
CINT2006/464.h264ref - some smaller code + changes similar to x264
DOE-ProxyApps-C/miniGMG - small variations
Benchmarks/Bullet - small variations
CFP2017rate/511.povray_r - small variations
DOE-ProxyApps-C/miniAMR - small variations
CFP2006/453.povray - small variations
DOE-ProxyApps-C++/CLAMR - small variations
MiBench/consumer-lame - small variations
CFP2006/447.dealII - small variations
CFP2017rate/538.imagick_r
CFP2017speed/638.imagick_s - small variations
CFP2017rate/510.parest_r - better vector code, small variations
CFP2017rate/526.blender_r - small variations
CINT2006/403.gcc - small variations
CINT2006/400.perlbench - small variations
CFP2017rate/508.namd_r - small variations
ASCI_Purple/SMG2000 - small variations
JM/lencod - extra store <16 x i32>, small variations
DOE-ProxyApps-C++/miniFE - small variations
JM/ldecod - extra vector code, small variations, less shuffles
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, mcpu=sifive-p470

Metric: size..text

                                                                               results    results0   diff
                                 test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  580768.00  581118.00   0.1%
                                        test-suite :: MultiSource/Applications/d/make_dparser.test   78854.00   78894.00   0.1%
                                      test-suite :: MultiSource/Applications/JM/lencod/lencod.test  633448.00  633750.00   0.0%
                                           test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277002.00  277080.00   0.0%
                             test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  931938.00  931960.00   0.0%
                                         test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 2512806.00 2512822.00   0.0%
                                test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 7659880.00 7659876.00  -0.0%
                                 test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 7659880.00 7659876.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 1602448.00 1602434.00  -0.0%
                          test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9496664.00 9496542.00  -0.0%
                     test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  147424.00  147422.00  -0.0%
                    test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1764608.00 1764578.00  -0.0%
                     test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1764608.00 1764578.00  -0.0%
                                     test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  841656.00  841632.00  -0.0%
                                    test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  949026.00  948962.00  -0.0%
                            test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  946348.00  946284.00  -0.0%
                                      test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  279794.00  279764.00  -0.0%
                       test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00  -0.1%
                              test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   25074.00   25028.00  -0.2%
                       test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   25074.00   25028.00  -0.2%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   29336.00   29184.00  -0.5%
                               test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  535390.00  510124.00  -4.7%
                              test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  535390.00  510124.00  -4.7%
test-suite :: SingleSource/Regression/C/gcc-c-torture/execute/ieee/GCC-C-execute-ieee-pr50310.test     886.00     608.00 -31.4%

CINT2006/464.h264ref - extra v16i32 reduction
d/make_dparser - better vector code
JM/lencod - extra v16i32 reduction
Benchmarks/Bullet - smaller vector code
CINT2006/400.perlbench - better vector code
CINT2006/403.gcc - small variations
CINT2017speed/602.gcc_s
CINT2017rate/502.gcc_r - small variations
CFP2017rate/510.parest_r - small variations
CFP2017rate/526.blender_r - small variations
MiBench/consumer-lame - small variations
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - small variations
Benchmarks/7zip - small variations
CFP2017rate/511.povray_r - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - extra vector code
mediabench/gsm - extra vector code
MiBench/telecomm-gsm - extra vector code
DOE-ProxyApps-C/miniGMG - extra vector code
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized
ieee/GCC-C-execute-ieee-pr50310 - extra code vectorized

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations

Reviewers: hiraditya

Reviewed By: hiraditya

Pull Request: https://github.com/llvm/llvm-project/pull/128907

Delta		File
+702	-73	llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
+230	-142	llvm/test/Transforms/SLPVectorizer/RISCV/complex-loads.ll
+222	-0	llvm/test/Transforms/SLPVectorizer/X86/split-node-reorder-node-with-ops.ll
+138	-26	llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll
+138	-26	llvm/test/Transforms/SLPVectorizer/X86/alternate-cast-inseltpoison.ll
+86	-20	llvm/test/Transforms/SLPVectorizer/X86/alternate-int.ll
+86	-20	llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
+82	-16	llvm/test/Transforms/SLPVectorizer/X86/alternate-fp.ll
+82	-16	llvm/test/Transforms/SLPVectorizer/X86/alternate-fp-inseltpoison.ll
+72	-0	llvm/test/Transforms/SLPVectorizer/X86/split-node-no-reorder-copy.ll
+20	-16	llvm/test/Transforms/SLPVectorizer/X86/same-values-sub-node-with-poisons.ll
+15	-13	llvm/test/Transforms/SLPVectorizer/X86/matched-shuffled-entries.ll
+9	-13	llvm/test/Transforms/SLPVectorizer/X86/splat-score-adjustment.ll
+5	-8	llvm/test/Transforms/SLPVectorizer/AArch64/tsc-s116.ll
+4	-8	llvm/test/Transforms/SLPVectorizer/addsub.ll
+5	-5	llvm/test/Transforms/SLPVectorizer/X86/non-load-reduced-as-part-of-bv.ll
+2	-8	llvm/test/Transforms/SLPVectorizer/X86/scatter-vectorize-reused-pointer.ll
+8	-0	llvm/include/llvm/Analysis/TargetTransformInfo.h
+4	-4	llvm/test/Transforms/SLPVectorizer/X86/long-full-reg-stores.ll
+3	-4	llvm/test/Transforms/SLPVectorizer/X86/reorder-phi-operand.ll
+3	-3	llvm/test/Transforms/SLPVectorizer/X86/reorder_diamond_match.ll
+2	-4	llvm/test/Transforms/SLPVectorizer/RISCV/reductions.ll
+2	-2	llvm/test/Transforms/SLPVectorizer/X86/phi.ll
+4	-0	llvm/lib/Analysis/TargetTransformInfo.cpp
+2	-2	llvm/test/Transforms/SLPVectorizer/X86/gathered-shuffle-resized.ll
+3	-1	llvm/test/Transforms/SLPVectorizer/X86/buildvector-schedule-for-subvector.ll
+1	-1	llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias.ll
+1	-1	llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll
+2	-0	llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+1	-1	llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias-inseltpoison.ll
+1	-1	llvm/test/Transforms/SLPVectorizer/X86/vec_list_bias_external_insert_shuffled.ll
+1	-0	llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+1	-0	llvm/lib/Target/X86/X86TargetTransformInfo.h
+1,937	-434	33 files

Unified Split Raw