[LoopInterchange] Constrain number of load/stores in a loop (#118973)
In the current state of the code, the transform computes entries for the
dependency matrix until `MaxMemInstrCount` which is 100. After 99th
entry, it terminates and thus overall wastes compile-time.
It would be nice if we can compute total number of entries upfront and
early exit if the number of entries > 100. However, computing the number
of entries is not always possible as it depends on two factors:
1. Number of load-store pairs in a loop.
2. Number of common loop levels for each of the pair.
This patch constrains the whole computation on the number of loads and
stores instructions in the loop.
In another approach, I experimented with computing 1 and constraining
the number of pairs, but that did not lead to any additional benefit in
terms of compile time. However, when other issues are fixed, I can
revisit this approach.
[OpenMP] Make each atomic helper take an atomic scope argument (#122786)
Summary:
Right now we just default to device for each type, and mix an ad-hoc
scope with the one used by the compiler's builtins. Unify this can make
each version take the scope optionally.
For @ronlieb, this will remove the need for `add_system` in the fork as
well as the extra `cas` with system scope, just pass `system`.
[OpenMP] Adjust 'printf' handling in the OpenMP runtime (#123670)
Summary:
We used to avoid a lot of this stuff because we didn't properly handle
variadics in device code. That's been solved for now, so we can just
make an internal printf handler that forwards to the external `vprintf`
function. This is either provided by NVIDIA's SDK or by the GPU libc
implementation.
The main reason for doing this is because it prevents the stupid AMDGPU
printf pass from mangling our beautiful printfs!
DAG: Fix vector_shuffle -> splat fold defining undef lanes
For shuffle vector splats with undef lanes in the mask,
this was introducing real values. Filter out build_vector
results based on the undef elements in the mask.
This avoids AMDGPU test regressions in a future change.
test/CodeGen/X86/urem-seteq-illegal-types.ll looks worse
but I didn't investigate.
[Flang] Add LLVM lowering support for UNTIED clause in Task (#121052)
Implementation details:
The UNTIED clause is recognized by setting the flag=0 for the default
case or performing logical OR to flag if other clauses are specified,
and this flag is passed as an argument to the `__kmpc_omp_task_alloc`
runtime call.
Resubmitting the PR with fix for the failure, as it was reverted here:
927a70daf31b1610627f346b0dc140eda72144b9
and previously merged here: https://github.com/llvm/llvm-project/pull/115283
AMDGPU: Expand shuffle testing with generated tests (#123574)
Add some generated tests with every shuffle permutation
for relevant vector element types and sizes. Not sure if this
is going overboard with the number of tests. I pruned out the largest
cases (16 and 32-bit cases are impractically large), and there's
redundancy when testing the pointer cases (at least for SelectionDAG).
This uses inline assembly to produce sample values because of how the
ABI is lowered when using a function argument. Since we break all
arguments into 32-bit pieces, a shuffle never ends up forming. We
need separate handling to reconstruct shuffles in contexts involving
physical registers in ABI contexts.
I wrote a small tool to generate these, so I can easily change the
exact test body. Not sure if it's worth posting anywhere.
This is in preparation for making better use of v_pk_mov_b32,
v_mov_b64 and s_mov_b64 in shuffles.
[SPIR-V] Fix type compatibility in memory order comparisons (#123676)
Fixed a type mismatch issue in the comparison of std::memory_order with
integers.
This fixes an issue reported by clang-debian-cpp20 buildbot for
https://github.com/llvm/llvm-project/pull/123654
[OffloadBundler] Compress bundles over 4GB (#122307)
Added initial support for version 3 of the compressed offload bundle
format, which uses 64-bit fields for Total File Size and Uncompressed
Binary Size. This enables support for files larger than 4GB. The support
is currently experimental and can be enabled by setting the environment
variable `COMPRESSED_BUNDLE_FORMAT_VERSION=3`.
[TableGen][NFC] Factor early-out range check. (#123645)
Combine the EarlyOut and IsContiguous range check.
Also avoid "comparison is always false" warnings in emitted code when
the lower-bound check is against 0.
[Github] Fix container push job
This patch fixes a typo impacting functionality and also adds the relevant
variables to the step outputs list so they can actually get picked up by the
push container step.
Workflows: Drop Windows release builds and use more powerful runners for others (#117111)
We have community provided Windows builds that are better than what we
can build on GitHub. For the Linux/X86 builds and Mac/Aarch64 builds we
will use depot runners, for Mac/X86 we will use the larger GitHub
runners.
[DirectX] Set the EnableRawAndStructuredBuffers shader flag (#122667)
When raw or structured buffers are used, we need to set the DXIL flag
saying so.
Fixes #122663.
[SLP]Fix vector factor for repeated node for bv
When adding a node vector, when it is used already in the shuffle for
buildvector, need to calculate vector factor from all vector, not only
this single vector, to avoid incorrect result. Also, need to increase
stability of the reused entries detection to avoid mismatch in cost
estimation/codegen.
Fixes #123639
Reapply "[AMDGPU] Handle natively unsupported types in addrspace(7) lowering" (#123660)
(#123657)
This reverts commit 64749fb01538fba2b56d9850497d5f3a626cabc2.
Adds a constructor to VecSlice to address the failure
[LLD][COFF] Simplify creation of .edata chunks (NFC) (#123651)
Since commit dadc6f2488684, only the constructor of the `EdataContents`
class is used. Replace it with a function and skip the call when using a
custom `.edata` section.